Example 9b
Rewriting lowercase()
(again!)

With reference to "Don't be over-zealous", I thought I'd try a few more optimisms.
Why?
Because.

Actually, it is quite a useful exercise as it demonstrates that while most code can be optimised, there are sometimes good reasons for not doing such a thing.

IMPORTANT! This example is designed to be compiled/assembled with the 'new' 26/32 bit compiler.
For the record, this is:

  cc     v5.51
  objasm v3.27
  link   v5.27

or, obviously, a later version.
For those using the older Norcroft C versions 4 or 5, I'll provide pointers to things to change.

Introduction

Following the introduction of the original lowercase() routine (read it here), Nick Roberts (of ASM fame) contacted me to say that I was not comparing the same thing in C and assembler; especially given that the C version has all of the overheads of a function call (with overheads) for each character to be converted.
This is a fair enough assessment, the tolower() function is exactly that - a function.

Hackery

The first step in this exercise was to work out exactly what the C library is doing. We will disassemble the supplied version of ansilib as I don't fancy poking inside SharedCLib.

The first step was to use !LibFile to strip ansilib into its components. That done, we toss the ctype sub-library to !DecAOF. It'll output loads of garbage.

In the symbol table we'll see:

tolower       : Global, Leaf, Relative, offset 0x00e0 in area "C$$code"

So, let's look at the dump for +&00E0:

  0x0000e0:  e51f100c  .... : LDR      r1,0xdc
  0x0000e4:  e5911004  .... : LDR      r1,[r1,#4]
  0x0000e8:  e3510000  ..Q. : CMP      r1,#0
  0x0000ec:  17d10000  .... : LDRNEB   r0,[r1,r0]
  0x0000f0:  11a0f00e  .... : MOVNE    pc,r14
  0x0000f4:  e51f10ec  .... : LDR      r1,0x10
  0x0000f8:  e7d11000  .... : LDRB     r1,[r1,r0]
  0x0000fc:  e3110010  .... : TST      r1,#0x10
  0x000100:  12800020   ... : ADDNE    r0,r0,#0x20
  0x000104:  e1a0f00e  .... : MOV      pc,r14

There it is. In essence, it isn't much different to our routine. But, you might ask, what is it referring to at &00DC?
For this, we look elsewhere in the dump for "STR" to that address, and we find it in the ctype initialisation routine, which contains:

  0x000360:  e1a00004  .... : MOV      r0,r4
  0x000364:  e3a01058  X... : MOV      r1,#0x58
  0x000368:  e2811a43  C... : ADD      r1,r1,#0x43000
  0x00036c:  ebffff80  .... : BL       getcvttable
  0x000370:  e51f529c  .R.. : LDR      r5,0xdc
  0x000374:  e5850000  .... : STR      r0,[r5,#0]
  0x000378:  e1a00004  .... : MOV      r0,r4
  0x00037c:  e3a01057  W... : MOV      r1,#0x57
  0x000380:  e2811a43  C... : ADD      r1,r1,#0x43000
  0x000384:  ebffff7a  z... : BL       getcvttable
  0x000388:  e5a50004  .... : STR      r0,[r5,#4]!

The first SWI built is &43058 ("Territory_UpperCaseTable"), the second is &43057 ("Territory_LowerCaseTable").

Now let's isolate the c_lowercase function. The bit that does the actual work is in bold, the rest sets up the APCS stack frame and does the standard stack space check.

c_lowercase
        MOV      ip,sp
        STMDB    sp!,{a1,v1,v2,fp,ip,lr,pc}
        SUB      fp,ip,#4
        CMP      sp,sl
        BLLT     __rt_stkovf_split_small
        MOV      v1,a1
        MOV      v2,#0
        LDRB     a1,[a1,#0]
        CMP      a1,#0
        LDMEQDB  fp,{v1,v2,fp,sp,pc}
|L0002cc.J4.c_lowercase|
        LDRB     a1,[v1,v2]
        BL       tolower
        STRB     a1,[v1,v2]
        ADD      v2,v2,#1
        LDRB     a1,[v1,v2]
        CMP      a1,#0
        BNE      |L0002cc.J4.c_lowercase|
        LDMDB    fp,{v1,v2,fp,sp,pc}

Tick-tock!

Now we have our finger on the problem. The C compiler is not doing that much different to what we were doing, however wading through all this rubbish we can see that Nick is not quite correct when he says:

"in your C example, each character invokes the penalty of an APCS-compliant procedure call, with all that implies as far as stack management is concerned."

I say this because the APCS setup (above) is essentially external to our core - which is the part marked in bold.
Deep inside, I might have wished for toupper() to have an 'expensive' APCS setup, as this might explain the disparate timings.

Instead, we have:

    [...our program...]
    |L0002cc.J4.c_lowercase|
.-> LDRB     a1,[v1,v2]
|   BL       tolower     --->  [...CLib jump-table...]
|                                   |
|                                      |
|                                   '----> [ ...in ansilib, probably the
|                                          same/similar in SharedCLib...]
|                                        LDR      r1,0xdc
|                                        LDR      r1,[r1,#4]
|                                        CMP      r1,#0
|                                        LDRNEB   r0,[r1,r0]
|                                        MOVNE    pc,r14
|                                        LDR      r1,0x10
|                                        LDRB     r1,[r1,r0]
|                                        TST      r1,#0x10
|                                        ADDNE    r0,r0,#0x20
|                                        MOV      pc,r14
|     .----------------------------------'
|    \|/
|   [...in our program again...]
|   STRB     a1,[v1,v2]
|   ADD      v2,v2,#1
|   LDRB     a1,[v1,v2]
|   CMP      a1,#0
'-- BNE      |L0002cc.J4.c_lowercase|

Before we go any further, I will reveal to you my timings. The exact differences will be discussed later. For now, we want to compare the C functional with the assembler.

IMPORTANT!
For all intents and purposes, the actual timings shown below are irrelevant. They were performed in a TaskWindow on an 32+1Mb RiscPC with and ARM710 and the nested WIMP. The actual results you get will vary wildly dependant upon your machine, your podules (both Econet and Ethernet can do 'background' stuff; so too can some types of SCSI card), the software that is running, etc, etc, etc...
What we are to concentrate on is the difference between the numbers. How much faster is X than Y, as a percentage? That sort of thing.

So, then, here are the results of the international jury...

Testing C functional method...
Time taken for 1000 iterations was 141 centiseconds.

Testing C non-functional method...
Time taken for 1000 iterations was 147 centiseconds.

Testing assembler method...
Time taken for 1000 iterations was 50 centiseconds.

Testing faster assembler method...
Time taken for 1000 iterations was 45 centiseconds.

The assembler method runs in 50 centiseconds. The C function runs in 141 centiseconds. What could be causing this? I added twenty NOP instructions to the inner loop of the assembler function and it went up to 108 centiseconds ... so even that was quicker!

So I then tried to split the assembler routine up. It loads a byte and branches to the code to do the tolower conversion, which then branches back. That took 65 centiseconds.
Wait! The true SharedCLibrary has another branch - we branch (with link) to the jump-table which does an absolute branch to the real code. So I mimicked this, and the timing came to 76cs. I think if I pushed a bunch of NOPs back in, I might get it to touch 100. So, in total, my routine appears to run that much faster than the standard library routines, and quite frankly I am at a loss to explain why the library is that much slower.

Now, you must expect about 2cs inaccuracy since I'm using the TaskWindow. However I repeat my tests three times and then I average the results. So, for the record, a test when linked with the stand-alone ansilib gives us:

Testing C functional method...
Time taken for 1000 iterations was 109 centiseconds.

Testing C non-functional method...
Time taken for 1000 iterations was 123 centiseconds.

Testing assembler method...
Time taken for 1000 iterations was 49 centiseconds.

Testing faster assembler method...
Time taken for 1000 iterations was 45 centiseconds.

Suddenly our code seems that much less impressive. In fact if our function takes 50cs normally, 108cs with twenty NOPs, and 15cs with a branch; then the time taken for NOPs and branch should be ( (108 - 50) + (65 - 50) + 50 ) which comes to 123cs. Given that 20 NOPs is a rather inaccurate method of adding extra time to the function, I think it is safe to draw a rough assumption that our code and the ansilib code would be pretty evenly matched - you know, if I did it that way... :-)
That still doesn't explain why the SharedCLibrary is that much slower. It is interesting that the stand-alone library code is about a quarter faster than the SharedCLibrary module!

The faster version

One thing we're always told is damned slow is the SWI call. There's a whole heap of code within the SWI vector (refer to Frobnicate issue twelve-and-a-half, page 14 - Nava Whiteford explains it better than I could!) so it is useful to minimise the overheads of SWI calling.
In the original lowercase(), we call the SWI to work out the current Territory, and the SWI to return the LowerCase lookup table (including the code for those SWIs themselves) once for each call, or a thousand times in all. The obvious optimisation is to perform this once at the beginning stash the lookup table address someplace. Then in the actual function we simply read this address and work from there.
By doing this, we shave off a mere five centiseconds.

One final idea

So if the overheads of an APCS call is thought to be a bad thing, then we can simply test this by taking the exact same code and in-lining it (i.e. no function call!).
Ironically, perhaps even disturbingly, it is slower! Here is the wrapper code that calls the C lowercase() function:

|L000064.J4.main|
        ADD      a1,sp,#&28
        BL       c_lowercase
        ADD      v1,v1,#1
        CMP      v1,#&3e8
        BLT      |L000064.J4.main|

This calls the c_lowercase() function (described above).

Here's the in-line version:

|L0000b8.J6.main|
        MOV      v1,#0
        LDRB     a1,[sp,#&28]
        CMP      a1,#0
        BEQ      |L0000f0.J8.main|
|L0000c8.J7.main|
        ADD      a1,sp,#&28
        LDRB     a1,[a1,v1]
        BL       tolower
        ADD      a2,sp,#&28
        STRB     a1,[a2,v1]
        ADD      v1,v1,#1
        ADD      a1,sp,#&28
        LDRB     a1,[a1,v1]
        CMP      a1,#0
        BNE      |L0000c8.J7.main|
|L0000f0.J8.main|
        ADD      v2,v2,#1
        CMP      v2,#&3e8
        BLT      |L0000b8.J6.main|

The difference? The &28 offsets from sp. These, in total, no doubt execute faster than the APCS stack frame setup in the function version; so surely the function should be a tad slower, if anything? The rest looks pretty similar, so...

In this case, an attempt to speed things up has failed. Miserably. Go figure!

PS: Changing the int i = 0; to register int i = 0; does not affect the timings.

The code

The C test program (c.runit):

In order to get this software to work on older machines, you will need to alter the code marked in navy blue.


/* c.runit

   Example of lowercase conversion routines in C and assembler.

   by Rick Murray, 2003/12/23

   For my Assembler tutorial at http://www.heyrick.co.uk/assembler/


   THIS CODE IS INTENDED FOR THE 26/32 COMPILER.
   IT USES A FEW ISO 9899:1990 THINGS, SO WILL NOT WORK ON EARLIER COMPILERS.

   THIS CODE (AND THE ASSEMBLER PARTS) *SHOULD* BE 32 BIT CLEAN.
*/

#include <stdio.h>
#include <ctype.h>
#include "kernel.h"
#include "swis.h"



void c_lowercase(char []);
extern void lowercase(char *);
extern void faster_lowercase_init(void);
extern void faster_lowercase(char *);



int main(void)
{
   // note that this is not strictly identical to the earlier code - each '¤'
   // character seems to be prefixed with 'Â'. Is this Fresco being weird?
   // I've left it in (basically because I only noticed it when marking up
   // this program for the HTML!)

   char s[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:" \
              ";\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_" \
              "-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`" \
              "!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1" \
              "234567890~`!@#$%^&*()_-+Â¤ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678" \
              "90~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVW" \
              "XYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKL" \
              "MNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/A" \
              "BCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+Â¤ABCDEFGH" \
              "IJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>" \
              ".?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|" \
              "\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*" \
              "()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ123456789" \
              "0~`!@#$%^&*()_-+Â¤ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$" \
              "%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ12345" \
              "67890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRST" \
              "UVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHI" \
              "JKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+¤Â¤";

   int  start = 0;
   int  end = 0;
   int  diff = 0;
   _kernel_swi_regs r;

   // We'll test calling the "c_lowercase" function
   printf("Testing C functional method...\n");
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   start = r.r[0];
   for (int loop = 0; loop < 1000; loop++)
      c_lowercase(s);
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   end = r.r[0];
   diff = end - start;
   printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff);

   // We'll do the same thing, only without the function call.
   printf("Testing C non-functional method...\n");
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   start = r.r[0];
   for (int loop = 0; loop < 1000; loop++)
   {
      int i = 0;

      while ( s[i] )
      {
         s[i] = tolower(s[i]);
         i++;
      }
   }
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   end = r.r[0];
   diff = end - start;
   printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff);

   // And let's see how the assembler version compares.
   printf("Testing assembler method...\n");
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   start = r.r[0];
   for (int loop = 0; loop < 1000; loop++)
      lowercase(s);
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   end = r.r[0];
   diff = end - start;
   printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff);

   // And let's see how the faster assembler version compares.
   printf("Testing faster assembler method...\n");
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   start = r.r[0];
   faster_lowercase_init();
   for (int loop = 0; loop < 1000; loop++)
      faster_lowercase(s);
   _kernel_swi(OS_ReadMonotonicTime, &r, &r);
   end = r.r[0];
   diff = end - start;
   printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff);

   return 0;
}



void c_lowercase(char string[])
{
   int  i = 0;
   while ( string[i] )
   {
      string[i] = tolower(string[i]);
      i++;
   }
   return;
}

The assembler code (s.lower):


; lowercase() in assembler...
;
;
; This code is written for objasm 3.27 or later.
;
; Other assemblers may need code modifications, or just remove
; the macro definition and references to it.


        AREA |C$$code|, CODE, READONLY



        ; This macro sets up the APCS backtrace 'name'
        MACRO
        HEAD     $name
        =        $name, 0
        ALIGN
        &        &FF000000 :OR: (:LEN: $name + 4 :AND: -4)
        MEND



; APCS registers
a1      RN      0
R0      RN      0
a2      RN      1
R1      RN      1
a3      RN      2
R2      RN      2
a4      RN      3
R3      RN      3
v1      RN      4
R4      RN      4
v2      RN      5
v3      RN      6
v4      RN      7
v5      RN      8
v6      RN      9
sl      RN      10
fp      RN      11
ip      RN      12
sp      RN      13
lr      RN      14
pc      RN      15




        EXPORT  |lowercase|

; R0 = Pointer to string (set on entry)
; R1 = Byte read from string
; R2 = Pointer to lowercase table
        HEAD     ("lowercase")
|lowercase|
        MOV      ip, sp
        STMFD    sp!, {a1, fp, ip, lr, pc}
        SUB      fp, ip, #4

        STMFD    sp!, {v6}

        MOV      R1, R0            ; Preserve string pointer
        SWI      &43040            ; "Territory_Number"
        SWI      &43057            ; "Territory_LowerCaseTable"
        MOV      R2, R0            ; Set lowercase table pointer
        MOV      R0, R1            ; Restore string pointer

lowercase_loop
        LDRB     R1, [R0]          ; Load character from R0
        CMP      R1, #0            ; Is it a null byte?
        LDMEQEA  fp, {fp, sp, pc}  ; Return if null (end of string)
        ; for APCS-R, the line above should end "pc}^"
        LDRB     R1, [R2, R1]      ; Convert to indexed lowercase character
        STRB     R1, [R0], #1      ; Store character, increment offset pointer
        B        lowercase_loop    ; Mulberry bushes


        EXPORT  |faster_lowercase_init|
|faster_lowercase_init|
        SWI      &43040            ; "Territory_Number"
        SWI      &43057            ; "Territory_LowerCaseTable"
        STR      R0, lowertable_ptr; Save the address of the table
        MOV      PC, R14
        ; for APCS-R, the line above should say "MOVS PC, R14"



        EXPORT  |faster_lowercase|
; R0 = Pointer to string (set on entry)
; R1 = Byte read from string
; R2 = Pointer to lowercase table
        HEAD     ("faster_lowercase")
|faster_lowercase|
        MOV      ip, sp
        STMFD    sp!, {a1, fp, ip, lr, pc}
        SUB      fp, ip, #4
        STMFD    sp!, {v6}

        LDR      R2, lowertable_ptr; Get lowercase table pointer
        ; this will NOT work if the offset to 'lowertable_ptr' is
        ; too far away for the LDR. In this case, use an indirect.
f_lowercase_loop
        LDRB     R1, [R0]          ; Load character from R0
        CMP      R1, #0            ; Is it a C-style terminator?
        ;CMPNE    R1, #13          ; Is it a BASIC-style terminator?
        ; enable the above line if you wish to handle BASIC-style strings
        LDMEQEA  fp, {fp, sp, pc}  ; Return if end of string
        ; for APCS-R, the line above should end "pc}^"
        LDRB     R1, [R2, R1]      ; Convert to indexed lowercase character
        STRB     R1, [R0], #1      ; Store character, increment offset pointer
        B        f_lowercase_loop  ; Mulberry bushes


        AREA |C$$data|, DATA

lowertable_ptr
        DCD     0

        END

The MakeFile:

If you are using the older compiler/assembler, you can use the older MakeFile. This one is for the new compiler.


# Project: lowertest 

.SUFFIXES:   .c .s .o
CCflags      = -c -depend !depend -I,C: -Otime -apcs 3/32 -throwback -fa
Linkflags    = -aif -o $@
ObjAsmflags  = -depend !depend -Stamp -Apcs 3/32 -Quit

c_files      = o.runit
asm_files    = o.lower
libraries    = C:o.stubs

@.lowertest: $(c_files) $(asm_files)
             link $(linkflags) $(asm_files) $(c_files) $(libraries)

.c.o:;       cc $(ccflags) $< -o $@
.s.o:;       objasm $(objasmflags) -o $@ $<

Wrapping up

You might be a bit bored if you've bothered to read this far.
That's perfectly okay.

I have done this to death to prove a point. While it is possible to make something better, there is only so much better that you can make it.
I am still a little shocked by two things:

How much poorer the SharedCLibrary code was than... well, everything else!
How little a difference it actually made to remove nearly 2000 SWI calls...

If you are a commercial programmer, then you have to write reliable code. Personally, I would rather have a slow operating system than a fast one that crashed a lot. You won't have time to experiment, and if you aren't too happy with your work situation, then the chances are that the last thing you'd care to do when you go home is...more programming!
However if you, like me, are a White Hat - somebody who thinks 'hacking' is fun (the good type, not the illegal type (you don't know how much it annoys me to always have to qualify the use of the word "hacking")), then maybe some time when you feel a little bored - cobble up some code like this. Run it. See what happens. It might surprise you!

Return to assembler index

Example 9b Rewriting lowercase() (again!)