Rick's b.log - 2015/07/27 |
|
It is the 24th of November 2024 You are 18.226.200.93, pleased to meet you! |
|
mailto:
blog -at- heyrick -dot- eu
Okay. One of the things I've been busy with is DeskLib, a nice library for RISC OS that offers an event-driven way to write programs for RISC OS. DeskLib has been around for ages and I used it way back when after realising how unpleasant the standard RISC OS library was. I wonder, actually, if the standard library wasn't just a tarted-up version of the libraries that Acorn used to write the base applications (!Draw, !Edit, and !Paint)?
There is an existing modern version of the library available at http://www.riscos.info/index.php/DeskLib as part of the GCCSDK, however the part that caught my eye was "The last AOF release, for use with older versions of GCC (3.4.6) and the Castle C/C++ compiler is 2.80" and "The current version is 3.00, and is ELF-only for use with latest versions of GCC". In other words, the library is going to switch to be ELF only which means it will only work with GCC. The native RISC OS development suite is no longer going to be a supported product.
Discussing this on the RISC OS forums, Steve Fryatt said:
There's a third point raised - Steve asks if there is really any excuse for developers to "refuse" to build ELF versions of the library. Note the emotive word "refuse". Despite all that I have said, I have absolutely no objection to my version of DeskLib being an ELF library in addition to ALF. Why? Simple. Compatibility. If everybody can use it, that's great. But it does have some technical problems, namely:
So, I'm not ruling out an ELF version of my fork of DeskLib. I'm only saying that it will be ALF for as long as I am maintaining it. If there's an ELF version too, then good. If not - kindly point your finger at GCC itself. If something wishes to rebuke standard platform conventions, then it should expect the difficulties that this may raise.
Oh, and yes - I am aware that ALF does not support dynamic linking, but then neither does ELF under RISC OS at the moment.
I have based my version of DeskLib on the original v2.30 sources (not the GCCSDK version). I had already made a hacky conversion to 26/32 neutral back in 2003; which was then recompiled in 2014 as my applications kept crashing thanks to the earlier (Castle era) compiler's propensity for using unaligned loads by default (dumb decision!).
Now, understand this: DeskLib was originally in the RISC OS 2 days, with various modifications for the enhanced features of RISC OS 3. That would put it as being, equally, a quarter century old. Things don't tend to change much in RISC OS - the latest version of the Desktop looks more or less the same as it always has, only now we can have nicer looking outline fonts in the windows instead of pug-ugly VDU text.
The first task was to revamp all of the 32bitted assembler code to use entry and exit macros wherever possible - this would permit the complexities of saving and restoring processor state to be hidden away in one single place. Better way, some clever use of objasm macros meant I could often enter a function telling it what registers I wanted saved, and then leave simply by calling the "EXIT" macro, and it would restore the appropriate registers. Additionally, these macros do different things if you're building a 32 bit target or a 26 bit target. The end result meaning that the code could be nice and clean and built for the appropriate target without grief (although I'm concentrating on 32 bit (RISC OS 5) at the moment):
The old code had quite the tendency to use
Sometimes, however, it was better just to rub the eyeballs and do the minimum necessary. Any more would hurt!
Back to the timings. A number of routines used the stack for register transfer. Fair enough, you could transfer multiple registers quickly as shown in the above example, but there are better ways. Consider the following code:
Here's the result:
The code can be optimised more, as follows:
For what it is worth, here's a breakdown A440 (8MHz ARM2) style:
That painful code? I gave it a whirl:
Update: I was reading the ARM System Developer's Guide on the way into town and noting the instruction timings for more recent processors. So I have inserted this information here.
Thanks to the cache and MMU in, well, every processor after the ARM2 (and definitely the ARM3), we can break the association between processor speed and memory speed. The S and N cycles relate to memory access times. While this will still have an effect, its effect will be mitigated by the cache preloading data, and so on, plus memory is a lot faster these days, so we can start to think of instructions as taking a specific number of cycles - as illustrated by the various timing diagrams above. The older processors didn't have the clever ability to dual-wield pipelines to get more done in less time, but they tried to be more efficient as processors. The thing is, the actual implementation varies, sometimes quite a bit.
Here's a little chart:
Where:
Please also do not take this chart as gospel for working out your own programs. You must consult the datasheet relevant to the processor you wish to optimise for - for example consider the LDM instruction of the ARM11. The chart above is a simplified version. Here's the truth:
Finally, I shall leave you with two chunks of code. The first is DeskLib showing its age by jumping through extraordinary hoops to save and restore the FP context around calls to
Well... That was nerdy, wasn't it?
So, DeskLib is coming along. Slowly, yes, but I'm dragging it kicking and screaming into the 21st century. I have put up a build of the new version here, as well as a big list of my plans and the various changes made along the way.
I feel like I ought to end by saying: Yoroshiku Onegaishimasu!!!!!!!! ☺
I'm still here!
No blog updates in a long time - that's because I have been busy elsewhere. Not to mention feeling tired a lot for no explicable reason. I guess it's the season - but I'm on holiday now - wheee!
The key point here is that having to maintain RISC OS-specific back-end tools for a compiler thats already perfectly capable of generating ARM code is massively wasteful in terms of developer resource when all that you actually need is a bunch of off-the-shelf tools that work with the standard ELF format and an unchanging elf2aif tool to convert the end result into something more familiar to RISC OS.
My reply? Well, first of all, I do not buy the "RISC OS specific back end" argument. The GCC compiler targets a hell of a lot of different platforms. There is little in common between Linux and Windows asides from probably running on x86 hardware - yet the compiler manages. With that in mind, either the compiler is horribly written, or it is designed so that different back ends can be slotted in (and given the frequent use of cross compiling - ARM code built on an x86 box - I would say that the back ends don't even have to match the host platform). With that in mind, surely the back end - once written - could mostly be recycled? I don't believe that this stuff is written over and over again for each iteration of GCC. That said - the choice of library is surely a problem for the linker, not the compiler? I would have thought that the linker would change less frequently...
The only real downside is the existence of libraries which are only distributed in ALF format (something that Rick seems keen to perpetuate with his new DeskLib). In practice Ive never really found this to be a problem in around 15 years of using GCC; if youre using the DDE libraries, youre probably using Norcroft anyway.
Im not sure that theres any excuse for current developers refusing to build ELF versions of their libraries. The bigger problem is actually going the other way: a GCC-using developer would need to fork out for the DDE to be able to generate ALF versions of any libraries that they wrote.
Next - even though the Norcroft/Castle/ROOL compiler is a commercial product - it is the proper native tool suite for RISC OS and it uses file types that are documented in the PRMs (the back of book four). This stuff has been standardised in print for the past 23 years, with the aim of having a method of making the tools able to be compatible with each other. Nick Roberts wrote ASM (note - probably the last framed site in existence, click on the Misc link), a free (and rather nice) assembler which works with the Norcroft suite and/or alternative linkers (drlink in the 26 bit days) exactly because all of these file formats were described and interoperable.
Now, I do not really care much about whether or not GCC chooses to use ELF as an intermediate stage to producing a valid RISC OS program. What concerns me is that newer versions of GCC are dropping support for AOF/ALF files, which is their decision to intentionally make themselves incompatible with the standard RISC OS tools. Fair enough, it's their decision to make, however personally I feel that the decision to migrate DeskLib - arguably the best library RISC OS has - to be GCC-only is not one that I can agree with. A free compiler suite? Great. A good library to go with it? Great. Incompatible with the Norcroft suite? Uhh... hang on... not so great.
Another thing to be aware of is that a lot is made of various attributes of ELF such as the fact that the executable version contains the program size and doesn't need all that WimpSlot rubbish. Well, surprise! So does AIF. The Read-only area size plus the Read-write area size plus the zero-init size plus debug size is how much memory the application requires not including malloc() etc (but then ELF likely can't say about that either). The fact that the operating system ignores this information is absolutely no reason to denigrate AIF in favour of ELF - the information has been present for over a quarter century, it isn't AIF's fault if older versions of the OS couldn't be bothered to look at it (RISC OS 5 does - CheckAIFMemoryLimit in FileSwitch.s.FSControl).
I used it myself in my own programs, and that was as far as it was going to go. Until I saw that the other one was moving over to being ELF-only.
STMFD R13!,{R14}
which uses the store-multiple to store a single value. Not a big deal on older machines, this and a single register store both take 2N cycles (about 500ns on an ARM2 at 8MHz with the A440's memory).
On the Cortex-A8 (Beagle-xM), three STRs of a single register will take 5 cycles, while three STMs of a single register will take 6 (STM takes at least two cycles). It is only a small thing, but it is worth optimising, as it can add up; especially with the likes of the Cortex-A8 (as in the Beagle xM, for instance) that run dual pipelines and can run instructions in parallel where possible.
STMFD R13!, {R0-R1, R14}
LDMFD R13!, {R1-R2}
MOV R0, #512
ADD R0, R0, #33
BL &8000
MOVVC R0, #0
LDMFD R13!, {PC}
The BL
is supposed to be a SWI call, but I can't get !A8time to recognise it either as SWI or the UAL SVC, so I dropped in a BL instead. We just want to know how long this code takes.
Cycle Pipeline 0 Pipeline 1
================================================================================
1 STMFD sp!,{r0-r1,lr} blocked during multi-cycle op
2 STM (cycle 2) need pipeline 0 for multi-cycle op
3 LDMFD sp!,{r1-r2} blocked during multi-cycle op
4 LDM (cycle 2) MOV r0,#512
5 ADD r0,r0,#33 need pipeline 0 for next op
6 BL &8000 MOVVC r0,#0
7 LDMFD sp!,{pc} blocked during multi-cycle op
8 LDM (cycle 2) blocked during multi-cycle op
9 LDM (cycle 3) blocked during multi-cycle op
10 LDM (cycle 4)
In honesty, it probably takes 11 cycles. I can't imagine pipeline 1 honestly doing anything while either a BL (or a SWI) is executed as these will branch off to other code. The big surprise is how many cycles are needed to pull the stack and write it to PC.
STR R14, [R13, #-4]!
MOV R2, R1
MOV R1, R0
MOV R0, #512
ADD R0, R0, #33
BL &8000
MOVVC R0, #0
LDR PC, [R13], #4
which provides us with the following timings:
Cycle Pipeline 0 Pipeline 1
================================================================================
1 STR lr,[sp,#-4]! MOV r2,r1
2 MOV r1,r0 MOV r0,#512
3 ADD r0,r0,#33 need pipeline 0 for next op
4 BL &8000 MOVVC r0,#0
5 LDR pc,[sp],#4 blocked during multi-cycle op
6 LDR (cycle 2)
11 down to 7 might not seem like much, but it is about 36% less - over a third. Expand this to count for all of the functions and all of the times that they may be called... its a lot.
STMFD R13!, {R0-R1, R14} ; 2S+2N 750ns
LDMFD R13!, {R1-R2} ; 2S+1N+1I 625ns
MOV R0, #512 ; 1S 125ns
ADD R0, R0, #33 ; 1S 125ns
BL &8000 ; 2S+1N 500ns
MOVVC R0, #0 ; 1S 125ns
LDMFD R13!, {PC} ; 2S+2N+1I 875ns
= 3000ns
As opposed to:
STR R14, [R13, #-4]! ; 2N 250ns
MOV R2, R1 ; 1S 125ns
MOV R1, R0 ; 1S 125ns
MOV R0, #512 ; 1S 125ns
ADD R0, R0, #33 ; 1S 125ns
BL &8000 ; 2S+1N 500ns
MOVVC R0, #0 ; 1S 125ns
LDR PC, [R13], #4 ; 2S+2N+1I 875ns
= 2250ns
Not quite the third faster as on the Cortex-A8, but with only a single pipeline we're still looking at a mite over a 25% increase in speed, simply by writing the same thing in a more optimal manner.
Note that BL and SWI both take 2S+1N to execute, plus whatever time is necessary to do the task. We're only interested in the code right in front of us.
Cycle Pipeline 0 Pipeline 1
================================================================================
1 MOV r12,sp need pipeline 0 for multi-cycle op
2 STMFD sp!,{r0-r7,lr} blocked during multi-cycle op
3 STM (cycle 2) blocked during multi-cycle op
4 STM (cycle 3) blocked during multi-cycle op
5 STM (cycle 4) blocked during multi-cycle op
6 STM (cycle 5) need pipeline 0 for multi-cycle op
7 LDMFD sp!,{r1-r4} blocked during multi-cycle op
8 LDM (cycle 2) blocked during multi-cycle op
9 LDM (cycle 3) need pipeline 0 for multi-cycle op
10 LDMFD r12!,{r5-r7} blocked during multi-cycle op
11 LDM (cycle 2) can't dual-issue on final cycle
12 MOV r0,#16 output conflict
13 ADD r0,r0,#256 need pipeline 0 for next op
14 BL &8000 LDRVC r12,[r12,#0]
15 wait for r12 wait for r12
16 wait for r12 wait for r12
17 STRVC r2,[r12,#0] MOVVC r0,#0
18 LDMFD sp!,{r4-r7,pc} blocked during multi-cycle op
19 LDM (cycle 2) blocked during multi-cycle op
20 LDM (cycle 3) blocked during multi-cycle op
21 LDM (cycle 4) blocked during multi-cycle op
22 LDM (cycle 5) can't dual-issue on final cycle
You don't want to try the maths for ARM2 style timings.
Operation ARM7TDMI StrongARM ARM9E ARM10E XScale ARM11E Cortex-A8 Family v4T v4 v5TE(J) v5TE(J) v5TEJ v6J v7- Pipeline 5 5 5 5+prediction 7 8+3predictors 13(dual)+prediction RISC OS RiscPC era (A7000) RiscPC era - - Iyonix Raspberry Pi Beagle xM MOV,ADD,... 1 1 1 1 1 1 1 * LDR 3 1 1 1 1 1-2 1 * LDR to PC 5 4 5 6 8 4-9 2 * STR 2 1 1 1 1 1-2 1 * LDM 2+N N(>=2) N 1+(Na/2) 2+N 1+(Na/2) (Na/2)(>=2) * LDM to PC 4+N 3+N 4+N 1+(Na/2) 7+N 4-9+(Na/2) 1+(Na/2) * STM 1+N N N 1+(Na/2) 2+N 1+(Na/2) (Na/2)(>=2) * B/BL 3 2 3 0-2(?+4?) 1(?+4?) 0-? 1 * SWI 3 ??? 3 ??? 6 8 ??? *
It can be seen that restoring PC, mispredictions, and SWI calls are increasingly costly as pipeline sizes increase - up until the massive redesign of the Cortex family. RiscPC era machines could process a SWI in three cycles, the Pi requires eight; the Beagle xM? Potentially one but the Cortex-A8 TRM doesn't actually say!
However do not lose sight of the fact that 3 cycles at 40MHz will be somewhat slower than 8 cycles at 800MHz!
If not loading PC, it takes one cycle and you can't start another memory access for (N+a-1)/2 cycles and the loaded register will not be available for (N+a+3)/2 cycles (where 'a' is bit 2 of the address).
If loading PC from the stack (R13), it takes 4 cycles; plus 5 extra if the internal return stack is empty or mispredicts; plus (N+a)/2 cycles; and the register will not be available or (N+a+5)/2 cycles.
If loading PC not from the stack (!=R13), it takes 8 cycles; plus (N+a)/2 cycles; and the register will not be available or (N+a+5)/2 cycles.
There are many more caveats, such as shifted offsets and the like which effect timings, that have been omitted from this table.
Wimp_Poll
(even though most applications probably didn't require this). Ready? Here we go...
PREAMBLE
STARTCODE Wimp_CorruptFPStateOnPoll
;
LDR a1, [pc, #L00001c-.-8]
MOV a2, #0
STR a2, [a1, #0]
MOV pc, lr
L00001c
DCD |x$dataseg|
IMPORT |x$stack_overflow|
IMPORT |_kernel_fpavailable|
;MPORT fp_status_save
IMPORT |_printf|
;
;
STARTCODE Wimp_Poll3
;
MOV ip, sp
STMFD sp!, {v1, v2, v3, fp, ip, lr, pc} ; APCS stack frame - so ignore objasm moaning
SUB fp, ip, #4
CMPS sp, sl
BLLT |x$stack_overflow|
;
; Save the parameters.
;
MOV v3, a3
MOV v2, a1
MOV v1, a2
;
; Check whether to save the FP status.
;
LDR a1, [pc, #L00001c-.-8]
LDR a1, [a1, #0]
CMPS a1, #0
BLNE |_kernel_fpavailable|
CMPNES a1, #0 ; ## Conditional following a BL, need
BLNE fp_status_save ; ## to check this still works on APCS-32.
;
; Save a flag indicating whether we saved or not.
; Note that the use of lr as a temporary variable precludes this
; code running in SVC mode.
;
MOV lr, a1
;
; Do the poll.
;
MOV a1, v2
ADD a2, v1, #4
MOV a4, v3
SWI SWI_Wimp_Poll + XOS_Bit
SUB a2, a2, #4
STR a1, [a2, #0]
MOVVC a1, #0
CMP lr, #0
BLNE fp_status_restore
[ {CONFIG}=32
LDMDB fp, {v1, v2, v3, fp, sp, pc} ; APCS stack frame, ignore objasm
|
LDMDB fp, {v1, v2, v3, fp, sp, pc}^
]
STARTCODE Wimp_PollIdle3
;
MOV ip, sp
[ {CONFIG}=32
STMFD sp!, {v1 - v4, fp, ip, lr, pc} ; APCS stack frame, ignore objasm
|
STMFD sp!, {v1 - v4, fp, ip, lr, pc}^
]
SUB fp, ip, #4
CMPS sp, sl
BLLT |x$stack_overflow|
;
; Save the parameters.
;
MOV v2, a1
MOV v1, a2
MOV v3, a3
MOV v4, a4
;
; Check whether to save the fp status.
;
LDR a1, [pc, #L00001c-.-8]
LDR a1, [a1, #0]
CMPS a1, #0
BLNE |_kernel_fpavailable|
CMPNES a1, #0
BLNE fp_status_save
MOV lr, a1
MOV a1, v2
ADD a2, v1, #4
MOV a3, v3
MOV a4, v4
SWI SWI_Wimp_PollIdle + XOS_Bit
CMP lr, #0
BLNE fp_status_restore
SUB a2, a2, #4
STR a1, [a2, #0]
MOVVC a1, #0
[ {CONFIG}=32
LDMDB fp, {v1-v4, fp, sp, pc} ; APCS return, ignore objasm
|
LDMDB fp, {v1-v4, fp, sp, pc}^
]
STARTCODE Wimp_SaveFPStateOnPoll
;
LDR a1, [pc, #L00001c-.-8]
MOV a2, #1
STR a2, [a1, #0]
[ {CONFIG}=32
MOV pc, lr
|
MOVS pc, lr
]
fp_status_restore
MOV v1, #0
WFS v1
LDFE f4, [sp, #0]
LDFE f5, [sp, #12]
LDFE f6, [sp, #24]
LDFE f7, [sp, #36]
ADD sp, sp, #48
LDR v1, [sp], #4
;LDMIA sp!, {v1}
WFS v1
MOV pc, lr
fp_status_save
RFS a2
STR a2, [sp, #-4]!
;STMDB sp!, {a2}
MOV a2, #0
WFS a2
SUB sp, sp, #48
STFE f4, [sp, #0]
STFE f5, [sp, #12]
STFE f6, [sp, #24]
STFE f7, [sp, #36]
[ {CONFIG}=32
MOV pc, lr
|
MOVS pc, lr
]
AREA |C$$data|
|x$dataseg|
save_fp
DCD &00000000
And now, since I have officially declared RISC OS 2 as no longer being supported (and quite likely RISC OS 3.10 too), I can take advantage of the fact that the Wimp has been able to do this for a quarter of a century. As such, here is my rewritten code. Spot the difference. It's a real WTF moment...
PREAMBLE
STARTCODE Wimp_CorruptFPStateOnPoll
; Does nothing - FP state is always saved - by the Wimp too...
MOV pc, lr
STARTCODE Wimp_SaveFPStateOnPoll
; Does nothing - FP state is always saved - by the Wimp too...
MOV pc, lr
STARTCODE Wimp_Poll3
; greatly simplified poll routine
; extern os_error *Wimp_Poll3(event_pollmask mask, event_pollblock *event, void *pollword);
ORR a1, a1, #1<<24 ; FP state always saved
ADD a2, a2, #4
MOV a4, a3 ; is this actually used?
SWI SWI_Wimp_Poll + XOS_Bit
STRVC a1, [a2, #-4]
MOVVC a1, #0
[ {CONFIG}=32
MOV pc, lr
|
MOVS pc, lr
]
STARTCODE Wimp_PollIdle3
; greatly simplified poll idle routine
; extern os_error *Wimp_Poll3(event_pollmask mask, event_pollblock *event, int earliest, void *pollword);
ORR a1, a1, #1<<24 ; FP state always saved
ADD a2, a2, #4
SWI SWI_Wimp_PollIdle + XOS_Bit
STRVC a1, [a2, #-4]
MOVVC a1, #0
[ {CONFIG}=32
MOV pc, lr
|
MOVS pc, lr
]
I hope that there are still RISC OS programmers using DeskLib, and that they will consider using my version on the new generation of machines.
VinceH, 27th July 2015, 20:16 Nick's isn't the last remaining Frames-based site - it isn't even the last one for RISC OS; two others that spring to mind are R-Comp and Archive Magazine!Gavin Wraith, 28th July 2015, 21:55 in a more optimal manner
Ouch! I think you mean "better".
on the page linked at the top
?? What link ?? Using NetSurf (3.4 #2865). I cannot see any links at the top.
Rick, 29th July 2015, 00:17 Well spotted! I forgot my own links. DUH! They have been added.
As for "more optimal manner", no, I meant exactly those three words. "Better" works for the layman, but is ambiguous as to WHY it is better. More optimal suggests that the goal here is to make the code do the same thing in fewer cycles.
It looks like you're using Textile markup (!), so I've swapped the _underscores_ for italics for you, since I had WinSCP running and, well, it just looks better doesn't it? ;-)
© 2015 Rick Murray |
This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted. RIPA notice: No consent is given for interception of page transmission. |