Example 9b
|
With reference to "Don't be over-zealous", I thought I'd
try a few more optimisms.
Why?
Because.
Actually, it is quite a useful exercise as it demonstrates that while most code can be optimised, there are sometimes good reasons for not doing such a thing.
IMPORTANT! This example is designed to be compiled/assembled with
the 'new' 26/32 bit compiler.
For the record, this is:
cc v5.51 objasm v3.27 link v5.27or, obviously, a later version.
The first step was to use !LibFile to strip ansilib into its components. That done, we toss the ctype sub-library to !DecAOF. It'll output loads of garbage.
In the symbol table we'll see:
tolower : Global, Leaf, Relative, offset 0x00e0 in area "C$$code"So, let's look at the dump for +&00E0:
0x0000e0: e51f100c .... : LDR r1,0xdc 0x0000e4: e5911004 .... : LDR r1,[r1,#4] 0x0000e8: e3510000 ..Q. : CMP r1,#0 0x0000ec: 17d10000 .... : LDRNEB r0,[r1,r0] 0x0000f0: 11a0f00e .... : MOVNE pc,r14 0x0000f4: e51f10ec .... : LDR r1,0x10 0x0000f8: e7d11000 .... : LDRB r1,[r1,r0] 0x0000fc: e3110010 .... : TST r1,#0x10 0x000100: 12800020 ... : ADDNE r0,r0,#0x20 0x000104: e1a0f00e .... : MOV pc,r14There it is. In essence, it isn't much different to our routine. But, you might ask, what is it referring to at &00DC?
0x000360: e1a00004 .... : MOV r0,r4 0x000364: e3a01058 X... : MOV r1,#0x58 0x000368: e2811a43 C... : ADD r1,r1,#0x43000 0x00036c: ebffff80 .... : BL getcvttable 0x000370: e51f529c .R.. : LDR r5,0xdc 0x000374: e5850000 .... : STR r0,[r5,#0] 0x000378: e1a00004 .... : MOV r0,r4 0x00037c: e3a01057 W... : MOV r1,#0x57 0x000380: e2811a43 C... : ADD r1,r1,#0x43000 0x000384: ebffff7a z... : BL getcvttable 0x000388: e5a50004 .... : STR r0,[r5,#4]!The first SWI built is &43058 ("Territory_UpperCaseTable"), the second is &43057 ("Territory_LowerCaseTable").
Now let's isolate the c_lowercase function. The bit that does the actual work is in bold, the rest sets up the APCS stack frame and does the standard stack space check.
c_lowercase MOV ip,sp STMDB sp!,{a1,v1,v2,fp,ip,lr,pc} SUB fp,ip,#4 CMP sp,sl BLLT __rt_stkovf_split_small MOV v1,a1 MOV v2,#0 LDRB a1,[a1,#0] CMP a1,#0 LDMEQDB fp,{v1,v2,fp,sp,pc} |L0002cc.J4.c_lowercase| LDRB a1,[v1,v2] BL tolower STRB a1,[v1,v2] ADD v2,v2,#1 LDRB a1,[v1,v2] CMP a1,#0 BNE |L0002cc.J4.c_lowercase| LDMDB fp,{v1,v2,fp,sp,pc}
I say this because the APCS setup (above) is essentially external to our core - which is the
part marked in bold.
Deep inside, I might have wished for toupper() to have an 'expensive' APCS setup, as this
might explain the disparate timings.
Instead, we have:
[...our program...] |L0002cc.J4.c_lowercase| .-> LDRB a1,[v1,v2] | BL tolower ---> [...CLib jump-table...] | | | | | '----> [ ...in ansilib, probably the | same/similar in SharedCLib...] | LDR r1,0xdc | LDR r1,[r1,#4] | CMP r1,#0 | LDRNEB r0,[r1,r0] | MOVNE pc,r14 | LDR r1,0x10 | LDRB r1,[r1,r0] | TST r1,#0x10 | ADDNE r0,r0,#0x20 | MOV pc,r14 | .----------------------------------' | \|/ | [...in our program again...] | STRB a1,[v1,v2] | ADD v2,v2,#1 | LDRB a1,[v1,v2] | CMP a1,#0 '-- BNE |L0002cc.J4.c_lowercase|Before we go any further, I will reveal to you my timings. The exact differences will be discussed later. For now, we want to compare the C functional with the assembler.
IMPORTANT!
For all intents and purposes, the actual timings shown below are irrelevant. They were
performed in a TaskWindow on an 32+1Mb RiscPC with and ARM710 and the nested WIMP. The actual
results you get will vary wildly dependant upon your machine, your podules (both Econet
and Ethernet can do 'background' stuff; so too can some types of SCSI card), the software that
is running, etc, etc, etc...
What we are to concentrate on is the difference between the numbers. How much faster is
X than Y, as a percentage? That sort of thing.
So, then, here are the results of the international jury...
Testing C functional method... Time taken for 1000 iterations was 141 centiseconds. Testing C non-functional method... Time taken for 1000 iterations was 147 centiseconds. Testing assembler method... Time taken for 1000 iterations was 50 centiseconds. Testing faster assembler method... Time taken for 1000 iterations was 45 centiseconds.The assembler method runs in 50 centiseconds. The C function runs in 141 centiseconds. What could be causing this? I added twenty NOP instructions to the inner loop of the assembler function and it went up to 108 centiseconds ... so even that was quicker!
So I then tried to split the assembler routine up. It loads a byte and branches to the code to
do the tolower conversion, which then branches back. That took 65 centiseconds.
Wait! The true SharedCLibrary has another branch - we branch (with link) to the jump-table which
does an absolute branch to the real code. So I mimicked this, and the timing came to 76cs. I
think if I pushed a bunch of NOPs back in, I might get it to touch 100. So, in total, my routine
appears to run that much faster than the standard library routines, and quite frankly I
am at a loss to explain why the library is that much slower.
Now, you must expect about 2cs inaccuracy since I'm using the TaskWindow. However I repeat my tests three times and then I average the results. So, for the record, a test when linked with the stand-alone ansilib gives us:
Testing C functional method... Time taken for 1000 iterations was 109 centiseconds. Testing C non-functional method... Time taken for 1000 iterations was 123 centiseconds. Testing assembler method... Time taken for 1000 iterations was 49 centiseconds. Testing faster assembler method... Time taken for 1000 iterations was 45 centiseconds.Suddenly our code seems that much less impressive. In fact if our function takes 50cs normally, 108cs with twenty NOPs, and 15cs with a branch; then the time taken for NOPs and branch should be ( (108 - 50) + (65 - 50) + 50 ) which comes to 123cs. Given that 20 NOPs is a rather inaccurate method of adding extra time to the function, I think it is safe to draw a rough assumption that our code and the ansilib code would be pretty evenly matched - you know, if I did it that way... :-)
|L000064.J4.main| ADD a1,sp,#&28 BL c_lowercase ADD v1,v1,#1 CMP v1,#&3e8 BLT |L000064.J4.main|This calls the c_lowercase() function (described above).
Here's the in-line version:
|L0000b8.J6.main| MOV v1,#0 LDRB a1,[sp,#&28] CMP a1,#0 BEQ |L0000f0.J8.main| |L0000c8.J7.main| ADD a1,sp,#&28 LDRB a1,[a1,v1] BL tolower ADD a2,sp,#&28 STRB a1,[a2,v1] ADD v1,v1,#1 ADD a1,sp,#&28 LDRB a1,[a1,v1] CMP a1,#0 BNE |L0000c8.J7.main| |L0000f0.J8.main| ADD v2,v2,#1 CMP v2,#&3e8 BLT |L0000b8.J6.main|The difference? The &28 offsets from sp. These, in total, no doubt execute faster than the APCS stack frame setup in the function version; so surely the function should be a tad slower, if anything? The rest looks pretty similar, so...
In this case, an attempt to speed things up has failed. Miserably. Go figure!
PS: Changing the int i = 0;
to register int i = 0;
does not affect the
timings.
/* c.runit Example of lowercase conversion routines in C and assembler. by Rick Murray, 2003/12/23 For my Assembler tutorial at http://www.heyrick.co.uk/assembler/ THIS CODE IS INTENDED FOR THE 26/32 COMPILER. IT USES A FEW ISO 9899:1990 THINGS, SO WILL NOT WORK ON EARLIER COMPILERS. THIS CODE (AND THE ASSEMBLER PARTS) *SHOULD* BE 32 BIT CLEAN. */ #include <stdio.h> #include <ctype.h> #include "kernel.h" #include "swis.h" void c_lowercase(char []); extern void lowercase(char *); extern void faster_lowercase_init(void); extern void faster_lowercase(char *); int main(void) { // note that this is not strictly identical to the earlier code - each '¤' // character seems to be prefixed with 'Â'. Is this Fresco being weird? // I've left it in (basically because I only noticed it when marking up // this program for the HTML!) char s[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:" \ ";\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_" \ "-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`" \ "!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1" \ "234567890~`!@#$%^&*()_-+¤ABCDEFGHIJKLMNOPQRSTUVWXYZ12345678" \ "90~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVW" \ "XYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKL" \ "MNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/A" \ "BCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+¤ABCDEFGH" \ "IJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>" \ ".?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+={[}]|" \ "\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*" \ "()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ123456789" \ "0~`!@#$%^&*()_-+¤ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890~`!@#$" \ "%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRSTUVWXYZ12345" \ "67890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHIJKLMNOPQRST" \ "UVWXYZ1234567890~`!@#$%^&*()_-+={[}]|\\:;\"\'<,>.?/ABCDEFGHI" \ "JKLMNOPQRSTUVWXYZ1234567890~`!@#$%^&*()_-+¤Â¤"; int start = 0; int end = 0; int diff = 0; _kernel_swi_regs r; // We'll test calling the "c_lowercase" function printf("Testing C functional method...\n"); _kernel_swi(OS_ReadMonotonicTime, &r, &r); start = r.r[0]; for (int loop = 0; loop < 1000; loop++) c_lowercase(s); _kernel_swi(OS_ReadMonotonicTime, &r, &r); end = r.r[0]; diff = end - start; printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff); // We'll do the same thing, only without the function call. printf("Testing C non-functional method...\n"); _kernel_swi(OS_ReadMonotonicTime, &r, &r); start = r.r[0]; for (int loop = 0; loop < 1000; loop++) { int i = 0; while ( s[i] ) { s[i] = tolower(s[i]); i++; } } _kernel_swi(OS_ReadMonotonicTime, &r, &r); end = r.r[0]; diff = end - start; printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff); // And let's see how the assembler version compares. printf("Testing assembler method...\n"); _kernel_swi(OS_ReadMonotonicTime, &r, &r); start = r.r[0]; for (int loop = 0; loop < 1000; loop++) lowercase(s); _kernel_swi(OS_ReadMonotonicTime, &r, &r); end = r.r[0]; diff = end - start; printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff); // And let's see how the faster assembler version compares. printf("Testing faster assembler method...\n"); _kernel_swi(OS_ReadMonotonicTime, &r, &r); start = r.r[0]; faster_lowercase_init(); for (int loop = 0; loop < 1000; loop++) faster_lowercase(s); _kernel_swi(OS_ReadMonotonicTime, &r, &r); end = r.r[0]; diff = end - start; printf("Time taken for 1000 iterations was %d centiseconds.\n\n", diff); return 0; } void c_lowercase(char string[]) { int i = 0; while ( string[i] ) { string[i] = tolower(string[i]); i++; } return; }
; lowercase() in assembler... ; ; ; This code is written for objasm 3.27 or later. ; ; Other assemblers may need code modifications, or just remove ; the macro definition and references to it. AREA |C$$code|, CODE, READONLY ; This macro sets up the APCS backtrace 'name' MACRO HEAD $name = $name, 0 ALIGN & &FF000000 :OR: (:LEN: $name + 4 :AND: -4) MEND ; APCS registers a1 RN 0 R0 RN 0 a2 RN 1 R1 RN 1 a3 RN 2 R2 RN 2 a4 RN 3 R3 RN 3 v1 RN 4 R4 RN 4 v2 RN 5 v3 RN 6 v4 RN 7 v5 RN 8 v6 RN 9 sl RN 10 fp RN 11 ip RN 12 sp RN 13 lr RN 14 pc RN 15 EXPORT |lowercase| ; R0 = Pointer to string (set on entry) ; R1 = Byte read from string ; R2 = Pointer to lowercase table HEAD ("lowercase") |lowercase| MOV ip, sp STMFD sp!, {a1, fp, ip, lr, pc} SUB fp, ip, #4 STMFD sp!, {v6} MOV R1, R0 ; Preserve string pointer SWI &43040 ; "Territory_Number" SWI &43057 ; "Territory_LowerCaseTable" MOV R2, R0 ; Set lowercase table pointer MOV R0, R1 ; Restore string pointer lowercase_loop LDRB R1, [R0] ; Load character from R0 CMP R1, #0 ; Is it a null byte? LDMEQEA fp, {fp, sp, pc} ; Return if null (end of string) ; for APCS-R, the line above should end "pc}^" LDRB R1, [R2, R1] ; Convert to indexed lowercase character STRB R1, [R0], #1 ; Store character, increment offset pointer B lowercase_loop ; Mulberry bushes EXPORT |faster_lowercase_init| |faster_lowercase_init| SWI &43040 ; "Territory_Number" SWI &43057 ; "Territory_LowerCaseTable" STR R0, lowertable_ptr; Save the address of the table MOV PC, R14 ; for APCS-R, the line above should say "MOVS PC, R14" EXPORT |faster_lowercase| ; R0 = Pointer to string (set on entry) ; R1 = Byte read from string ; R2 = Pointer to lowercase table HEAD ("faster_lowercase") |faster_lowercase| MOV ip, sp STMFD sp!, {a1, fp, ip, lr, pc} SUB fp, ip, #4 STMFD sp!, {v6} LDR R2, lowertable_ptr; Get lowercase table pointer ; this will NOT work if the offset to 'lowertable_ptr' is ; too far away for the LDR. In this case, use an indirect. f_lowercase_loop LDRB R1, [R0] ; Load character from R0 CMP R1, #0 ; Is it a C-style terminator? ;CMPNE R1, #13 ; Is it a BASIC-style terminator? ; enable the above line if you wish to handle BASIC-style strings LDMEQEA fp, {fp, sp, pc} ; Return if end of string ; for APCS-R, the line above should end "pc}^" LDRB R1, [R2, R1] ; Convert to indexed lowercase character STRB R1, [R0], #1 ; Store character, increment offset pointer B f_lowercase_loop ; Mulberry bushes AREA |C$$data|, DATA lowertable_ptr DCD 0 END
# Project: lowertest .SUFFIXES: .c .s .o CCflags = -c -depend !depend -I,C: -Otime -apcs 3/32 -throwback -fa Linkflags = -aif -o $@ ObjAsmflags = -depend !depend -Stamp -Apcs 3/32 -Quit c_files = o.runit asm_files = o.lower libraries = C:o.stubs @.lowertest: $(c_files) $(asm_files) link $(linkflags) $(asm_files) $(c_files) $(libraries) .c.o:; cc $(ccflags) $< -o $@ .s.o:; objasm $(objasmflags) -o $@ $<
I have done this to death to prove a point. While it is possible to make something better, there
is only so much better that you can make it.
I am still a little shocked by two things:
If you are a commercial programmer, then you have to write reliable code. Personally, I would
rather have a slow operating system than a fast one that crashed a lot. You won't have
time to experiment, and if you aren't too happy with your work situation, then the chances are
that the last thing you'd care to do when you go home is...more programming!
However if you, like me, are a White Hat - somebody who thinks 'hacking' is fun (the good type,
not the illegal type (you don't know how much it annoys me to always have to qualify the use of
the word "hacking")), then maybe some time when you feel a little bored - cobble up
some code like this. Run it. See what happens. It might surprise you!