mailto: blog -at- heyrick -dot- eu

Navi: Previous entry Display calendar Next entry
Switch to desktop version

FYI! Last read at 18:18 on 2024/11/21.

ARM vs x86

Intel 486 vs ARM 2

This is a difficult thing to write, for the only sane way to think of a processor is "as an x86" or "as an ARM". Trying to co-relate one to another will always be sub-optimal; like a BASIC coder writing C as if it was BASIC...

However, of late I have been asked a few times for a comparison chart between the two, so here goes. And note, I'm not familiar with the x86, so I'm just going to ignore instructions I don't know...

Basic instruction set comparisons

x86 ARM
ADC ADC
ADD ADD
AND AND
BT BIT
CALL BL
CLI Modify CPSR with AND/ORR/EOR...
CMOVcc MOV with a condition code
CMP CMP
CPUID Access CP15 for CPU info; there's no CPUID
DEC SUB Rx,Rx,#1
HLT the point being...?
IMUL SMULL
IN LDR[B] from address
INC ADD Rx,Rx,#1
INT SWI
INVD Use CP15 for cache control
INVLPG Use CP15 for cache control
IRET Push relevant R14 into PC
JA
JAE
JB
JBE
JC
JE
JG
JGE
JL
JLE
JMP
JNA
JNAE
JNB
JNBE
JNC
JNE
JNG
JNGE
JNL
JNLE
JNO
JNS
JNZ
JO
JP
JS
JZ
B with appropriate condition code
LOADALL LDMFD ?
LODS LDR[B] ?
MOV MOV, LDR, LDM, etc
MUL MUL, MLA, or UMULL
NEG Done logically, RSB etc
NOP MOV Rx,Rx (copy a variable to itself)
NOT MVN
OR ORR
OUT STR[B] to address
PUSH STMFD from stack
POP LDMFD from stack
POR ORR
PXOR EOR
SETALC Modify CPSR
SUB SUB
SYSCALL SWI ?
WBINVD Use CP15 for cache control
XOR EOR

Caveats

One of the problems with the ARM is it is a load/store processor. Remarkably few instructions communicate externally - mainly LDM, LDR[B], STM, STR[B], and SWP. All others work with registers. The idea being that relevant data is loaded (and you have over ten registers plus a cache in your favour) and heavy processing can then take place within the processor core, only stalling for memory access when necessary.

With the exception of PC and R14 (LR), there is no specific use for any of the other registers. Tradition dictates that R13 is the stack pointer, and as far as I know, most systems keep to this, however it is not written in stone.

Conditional execution rocks!

There is no underestimating the power of conditional execution. Consider the following code:
  #include "stdio.h"

  void write_char(int);

  int  main(void)
  {
     int  loop = 0;

     for (loop = 0; loop <= 256; loop++)
        write_char(loop);

     return 0;
  }


  void write_char(int mychr)
  {
     if (mychr < 32)
        mychr = '.';
     else
        if (mychr >= 127)
           mychr = '.';

     printf("%c", mychr);

     return;
  }

This is a wrapper code around a little function that is useful for "cooking" data for display, for times when throwing control codes at the terminal may result in all manner of weirdness.

You get a Brownie Point if you can find the intentional bug. (^_^)

In the following examples, I shall ignore code involved in setting up the C runtime, etc, and will concentrate purely upon the code that relates to the above bit of code.

Example - TurboC++ v1.0

First up, in all its oh-my-God glory, is Borland TurboC++ 1.0's code, optimised for speed and a 286 class processor.
     ; int  main(void)
     ; {
     ;    int  loop = 0;
   mov word ptr [bp-2],0
     ; 
     ;    for (loop = 0; loop <= 256; loop++)
   mov word ptr [bp-2],0
   jmp short @1@98
  @1@50:
     ; 
     ;       write_char(loop);
   push word ptr [bp-2]
   call near ptr _write_char
   pop cx
   inc word ptr [bp-2]
  @1@98:
   cmp word ptr [bp-2],256
   jle short @1@50
     ; 
     ;    return 0;
   xor ax,ax
     ; }
   leave 
   ret 
  _main endp
     ; 
     ; void write_char(int mychr)
   assume cs:_TEXT
  _write_char proc near
   push bp
   mov bp,sp
     ;    if (mychr < 32)
   cmp word ptr [bp+4],32
   jge short @2@74
   jmp short @2@98
  @2@74:
     ;       mychr = '.';
     ;    else
     ;       if (mychr >= 127)
   cmp word ptr [bp+4],127
   jl short @2@122
  @2@98:
     ;          mychr = '.';
   mov word ptr [bp+4],46
  @2@122:
     ; 
     ;    printf("%c", mychr);
   push word ptr [bp+4]
   push offset DGROUP:s@
   call near ptr _printf
   pop cx
   pop cx
     ;    return;
     ; }
   pop bp
   ret 
Due to how the x86 works, something so simple requires a fair amount of branching. It is no wonder the x86 has advanced pipeline prediction.

Essentially, the code character comparison works as follows:

    compare char with 32
      is it less?
        if no go to '2-74', otherwise go to '2-98'.

  2-74:
    compare char with 127
    if less than, go to '2-122'.

    = falls through =

  2-98:
    set char to 46 (ASCII code for '.')

    = falls through =

  2-122:
    print the character
That code is somewhat displeasing, but it is logical and constrained by what the x86 can and cannot do. However, a printable character can be output with only one branch.

Example - OpenWatcom v1.2

I will show you only the crux of the write_char function. OpenWatcom sets the '.' twice to make the code a little more efficient, at the expense of size. But then again, it seems as if OpenWatcom's default setting is to automatically save (and later restore) every register in function calls... Code is for Pentium class processors, with all possible optimisations (hehe, -otexan).
  ; {
  ;    if (mychr < 32)
      cmp         dword ptr -8[ebp],20H 
      jge         L$3 

  ;    {
  ;       mychr = '.';
      mov         dword ptr -8[ebp],2eH 

  ;    }
  ;    else
      jmp         L$4 

  ;    {
  ;       if (mychr >= 127)
  L$3:
      cmp         dword ptr -8[ebp],7fH 
      jl          L$4 

  ;       {
  ;          mychr = '.';
      mov         dword ptr -8[ebp],2eH 

  ;       }
  ;    }
  ; 
  ;    printf("%c", mychr);
  L$4:
      mov         eax,dword ptr -8[ebp] 
      push        eax 
      mov         eax,offset FLAT:L$5 
      push        eax 
      call        near ptr FLAT:printf_ 
      add         esp,8 
The flow here is slightly different, perhaps tidier. Unfortunately a printable character requires two branches in order to pass through the code.
    compare byte with 32
    if it is greater or equal, go to "L-3"

    set byte to '.'
    go to "L-4"

  L-3:
    compare byte with 127
    if it is less than, go to "L-4"

    set byte to '.'

    = falls through =

  L-4:
    print the character

ARM code - C compiler suite v5

Now let's see the same thing as ARM code. This was created with cc of Acorn/Norcroft heritage, in the early 32 bit Castle incarnation. Optimisation favours speed over size. Because this compiler does not interleave source with code, I shall present it with my own comments.
          MOV      v1,#0                   ; initialise loop counter
  |L000024.J4.main|
          MOV      a1,v1
          BL       write_char              ; call "write_char"
          ADD      v1,v1,#1                ; loop = loop + 1
          CMP      v1,#&100                ; compare loop with 256
          BLE      |L000024.J4.main|       ; if less or equal, repeat
          MOV      a1,#0
          LDMDB    fp,{v1,fp,sp,pc}        ; exit :-)

          IMPORT  _printf
  write_char
          CMP      a1,#&20                 ; compare byte with 32
          MOVLT    a1,#&2e                 ; if less than, set to '.'
          BLT      |L000054.J6.write_char| ; if less than, branch to print
          CMP      a1,#&7f                 ; compare byte with 127
          MOVGE    a1,#&2e                 ; if greater or equal, set to '.'
  |L000054.J6.write_char|
          MOV      a2,a1                   ; call CLib "_printf" function
          ADD      a1,pc,#L000060-.-8
          B        _printf
I won't present a walkthrough as the comments provide much the same thing. However it is worth noting that, thanks to conditional execution, there is only one branch in the entire write_char function, and that a printable character can pass through with no branch whatsoever.

Note also that we BL to write_char (writes return address in R14) but we B (no return address) to the _printf function. This is a nifty piece of optimisation meaning that the return from _printf will pick up on the return address in R14, and thus go right back to the main loop.
Note also the LDMFD (shown by its "proper" "DB" suffix) restores all of the important registers from the frame pointer - v1 that we corrupted, the previous frame pointer address, the previous stack address, and the return address pushed directly into the program counter. All in one instruction. No chains of pop, pop, pop, pop...

 

Your comments:

opal, 16th September 2011, 15:22
Hello, 
My guess for your little problem is that the range of characters is one char larger than needed. It should have been: 
for (loop = 0; loop <= 255; loop++)  
Regards

Add a comment (v0.11) [help?]
Your name:

 
Your email (optional):

 
Validation:
Please type 41261 backwards.

 
Your comment:

 

Navi: Previous entry Display calendar Next entry
Switch to desktop version

Search:

See the rest of HeyRick :-)