Rick's b.log - 2011/08/07 |
|
It is the 21st of November 2024 You are 3.144.42.233, pleased to meet you! |
|
mailto:
blog -at- heyrick -dot- eu
This is a difficult thing to write, for the only sane way to think of a processor is "as an x86" or "as an ARM". Trying to co-relate one to another will always be sub-optimal; like a BASIC coder writing C as if it was BASIC...
However, of late I have been asked a few times for a comparison chart between the two, so here goes. And note, I'm not familiar with the x86, so I'm just going to ignore instructions I don't know...
With the exception of PC and R14 (LR), there is no specific use for any of the other registers. Tradition dictates that R13 is the stack pointer, and as far as I know, most systems keep to this, however it is not written in stone.
This is a wrapper code around a little function that is useful for "cooking" data for display, for times when throwing control codes at the terminal may result in all manner of weirdness.
In the following examples, I shall ignore code involved in setting up the C runtime, etc, and will concentrate purely upon the code that relates to the above bit of code.
Essentially, the code character comparison works as follows:
Note also that we BL to write_char (writes return address in R14) but we B (no return address) to the _printf function. This is a nifty piece of optimisation meaning that the return from _printf will pick up on the return address in R14, and thus go right back to the main loop.ARM vs x86
Basic instruction set comparisons
x86
ARM
ADC
ADC
ADD
ADD
AND
AND
BT
BIT
CALL
BL
CLI
Modify CPSR with AND/ORR/EOR...
CMOVcc
MOV with a condition code
CMP
CMP
CPUID
Access CP15 for CPU info; there's no CPUID
DEC
SUB Rx,Rx,#1
HLT
the point being...?
IMUL
SMULL
IN
LDR[B] from address
INC
ADD Rx,Rx,#1
INT
SWI
INVD
Use CP15 for cache control
INVLPG
Use CP15 for cache control
IRET
Push relevant R14 into PC
JA
JAE
JB
JBE
JC
JE
JG
JGE
JL
JLE
JMP
JNA
JNAE
JNB
JNBE
JNC
JNE
JNG
JNGE
JNL
JNLE
JNO
JNS
JNZ
JO
JP
JS
JZB with appropriate condition code
LOADALL
LDMFD ?
LODS
LDR[B] ?
MOV
MOV, LDR, LDM, etc
MUL
MUL, MLA, or UMULL
NEG
Done logically, RSB etc
NOP
MOV Rx,Rx (copy a variable to itself)
NOT
MVN
OR
ORR
OUT
STR[B] to address
PUSH
STMFD from stack
POP
LDMFD from stack
POR
ORR
PXOR
EOR
SETALC
Modify CPSR
SUB
SUB
SYSCALL
SWI ?
WBINVD
Use CP15 for cache control
XOR
EOR
Caveats
One of the problems with the ARM is it is a load/store processor. Remarkably few instructions communicate externally - mainly LDM, LDR[B], STM, STR[B], and SWP. All others work with registers. The idea being that relevant data is loaded (and you have over ten registers plus a cache in your favour) and heavy processing can then take place within the processor core, only stalling for memory access when necessary.
Conditional execution rocks!
There is no underestimating the power of conditional execution. Consider the following code:
#include "stdio.h"
void write_char(int);
int main(void)
{
int loop = 0;
for (loop = 0; loop <= 256; loop++)
write_char(loop);
return 0;
}
void write_char(int mychr)
{
if (mychr < 32)
mychr = '.';
else
if (mychr >= 127)
mychr = '.';
printf("%c", mychr);
return;
}
Example - TurboC++ v1.0
First up, in all its oh-my-God glory, is Borland TurboC++ 1.0's code, optimised for speed and a 286 class processor.
; int main(void)
; {
; int loop = 0;
mov word ptr [bp-2],0
;
; for (loop = 0; loop <= 256; loop++)
mov word ptr [bp-2],0
jmp short @1@98
@1@50:
;
; write_char(loop);
push word ptr [bp-2]
call near ptr _write_char
pop cx
inc word ptr [bp-2]
@1@98:
cmp word ptr [bp-2],256
jle short @1@50
;
; return 0;
xor ax,ax
; }
leave
ret
_main endp
;
; void write_char(int mychr)
assume cs:_TEXT
_write_char proc near
push bp
mov bp,sp
; if (mychr < 32)
cmp word ptr [bp+4],32
jge short @2@74
jmp short @2@98
@2@74:
; mychr = '.';
; else
; if (mychr >= 127)
cmp word ptr [bp+4],127
jl short @2@122
@2@98:
; mychr = '.';
mov word ptr [bp+4],46
@2@122:
;
; printf("%c", mychr);
push word ptr [bp+4]
push offset DGROUP:s@
call near ptr _printf
pop cx
pop cx
; return;
; }
pop bp
ret
Due to how the x86 works, something so simple requires a fair amount of branching. It is no wonder the x86 has advanced pipeline prediction.
compare char with 32
is it less?
if no go to '2-74', otherwise go to '2-98'.
2-74:
compare char with 127
if less than, go to '2-122'.
= falls through =
2-98:
set char to 46 (ASCII code for '.')
= falls through =
2-122:
print the character
That code is somewhat displeasing, but it is logical and constrained by what the x86 can and cannot do. However, a printable character can be output with only one branch.
Example - OpenWatcom v1.2
I will show you only the crux of the write_char function. OpenWatcom sets the '.' twice to make the code a little more efficient, at the expense of size. But then again, it seems as if OpenWatcom's default setting is to automatically save (and later restore) every register in function calls... Code is for Pentium class processors, with all possible optimisations (hehe, -otexan).
; {
; if (mychr < 32)
cmp dword ptr -8[ebp],20H
jge L$3
; {
; mychr = '.';
mov dword ptr -8[ebp],2eH
; }
; else
jmp L$4
; {
; if (mychr >= 127)
L$3:
cmp dword ptr -8[ebp],7fH
jl L$4
; {
; mychr = '.';
mov dword ptr -8[ebp],2eH
; }
; }
;
; printf("%c", mychr);
L$4:
mov eax,dword ptr -8[ebp]
push eax
mov eax,offset FLAT:L$5
push eax
call near ptr FLAT:printf_
add esp,8
The flow here is slightly different, perhaps tidier. Unfortunately a printable character requires two branches in order to pass through the code.
compare byte with 32
if it is greater or equal, go to "L-3"
set byte to '.'
go to "L-4"
L-3:
compare byte with 127
if it is less than, go to "L-4"
set byte to '.'
= falls through =
L-4:
print the character
ARM code - C compiler suite v5
Now let's see the same thing as ARM code. This was created with cc of Acorn/Norcroft heritage, in the early 32 bit Castle incarnation. Optimisation favours speed over size.
Because this compiler does not interleave source with code, I shall present it with my own comments.
MOV v1,#0 ; initialise loop counter
|L000024.J4.main|
MOV a1,v1
BL write_char ; call "write_char"
ADD v1,v1,#1 ; loop = loop + 1
CMP v1,#&100 ; compare loop with 256
BLE |L000024.J4.main| ; if less or equal, repeat
MOV a1,#0
LDMDB fp,{v1,fp,sp,pc} ; exit :-)
IMPORT _printf
write_char
CMP a1,#&20 ; compare byte with 32
MOVLT a1,#&2e ; if less than, set to '.'
BLT |L000054.J6.write_char| ; if less than, branch to print
CMP a1,#&7f ; compare byte with 127
MOVGE a1,#&2e ; if greater or equal, set to '.'
|L000054.J6.write_char|
MOV a2,a1 ; call CLib "_printf" function
ADD a1,pc,#L000060-.-8
B _printf
I won't present a walkthrough as the comments provide much the same thing. However it is worth noting that, thanks to conditional execution, there is only one branch in the entire write_char function, and that a printable character can pass through with no branch whatsoever.
Note also the LDMFD (shown by its "proper" "DB" suffix) restores all of the important registers from the frame pointer - v1 that we corrupted, the previous frame pointer address, the previous stack address, and the return address pushed directly into the program counter. All in one instruction. No chains of pop, pop, pop, pop...
opal, 16th September 2011, 15:22 Hello,
My guess for your little problem is that the range of characters is one char larger than needed. It should have been:
for (loop = 0; loop <= 255; loop++)
Regards
© 2011 Rick Murray |
This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted. RIPA notice: No consent is given for interception of page transmission. |