Memory schemes
|
Memory is a resource that needs careful management. It is expensive (£/Mb is much higher
for memory than for conventional harddisc storage). A good system will offer flexible facilities
trading off speed for functionality.
You need memory because it is fast. It is rarely as fast as the processor, these days, but it is
faster than harddiscs. Because we need fast. We need big, so we can hold these large programs
and large amounts of data that seem to be around. It boggles the mind that a commercial mainframe
did accounts and stuff with a mere 4K of memory.
Typically, there will be three or four, possibly five, kinds of storage in the computer.
Consider:
.----------------. .----------------. | OS in ROM | | Device drivers | | | | in ROM | |----------------| |----------------| | | | | | Your | | Your | | application | | application | | | |----------------| |----------------| | | |System workspace| | OS in RAM | '----------------' '----------------'The first example is similar to the layout of the BBC microcomputer. The second is not that different to a basic MS-DOS system, the OS is loaded low in memory, the BIOS is mapped in at the top, and the application sits in the middle.
To be honest, the first example is used a lot under RISC OS as well. It is exactly what a
standard application is supposed to believe. The OS uses page zero (&0000 - &7FFF)
for internal housekeeping, it (your app) begins at &8000, and the hardware/OS sit way
up in the ether at &3800000.
Memory management under RISC OS is more complex, but this is how a typical application will see
things.
When the memory is organised in this way, only one application can be running. When the user enters a command, if it is an application then that application is copied from disc into memory, then it is executed. When the application is done with, the operating system reappears, waiting for you to give it something else to do.
Memory is typically handled as non-contiguous blocks. On an ARM machine, pages are brought
together to fake a chunk of memory beginning at &8000. Anybody who has tried an address
translation in their allocated memory will know two things. Firstly, it is near impossible to
get an actual physical memory address out of the OS.
The following program demonstrates this:
END = &10000 : REM Constrain slot to 32K DIM willow% 16 SYS "Wimp_SlotSize", -1, -1 TO slot% SYS "OS_ReadMemMapInfo" TO page% PRINT "Using "+STR$(slot% / page%)+" pages, each page being "+STR$(page%)+" bytes." PRINT "Pages used: "; more% = slot% / page% FOR loop% = 0 TO (more% - 1) willow%!0 = 0 willow%!4 = &8000 + (loop% * page%) willow%!8 = 0 willow%!12= -1 SYS "OS_FindMemMapEntries", willow% IF loop% > 0 THEN PRINT ", "; PRINT STR$(willow%!0); NEXT PRINT ENDThis outputs something similar to:
Using 8 pages, each page being 4096 bytes. Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678
RISC OS handles memory by loading everything into memory. These applications are then 'paged in' by remapping the memory pointers in the page tables, consequently, other tasks are mapped out.
Windows/Unix systems load applications into memory, supported by a system called 'virtual memory'
which dumps unused pages to disc in order to free system memory for applications that need it.
I am not sure how Windows organises its memory, if it does it in a style similar to RISC OS
(ie, remap to start from a specific address) or if each application is just told 'you are here'.
Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you are careful
how you load it, and swap out old parts for new parts as necessary.
Some systems use a lazy-paging form of memory. In this case, only the first page of memory is
filled by the application when execution starts. As more of the application is executed, the
operating system fills in the parts as required.
By contrast, under RISC OS an application needs to load. Consider loading, well, practically
anything, off of floppy disc. It takes time.
When the processor is instructed to jump to &8000 to begin executing an application, it passes the address &8000 to the MMU. This translates the address into the correct real address and outputs this on the address lines, say &12FC00. The processor is not aware of this, the application is not aware of this, the computer user is not aware of this.
So we can take this one stage further by mapping onwards into memory that does not exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the operating system will be called in a panic (correctly known as a "page fault"). The operating system will be calm and collected and think, "Ah, virtual memory". A little-used page of real memory will be shoved out to disc, then the page that the MMU was trying to find will be reloaded in place of the page we just got rid of. The memory map will be updated accordingly, then control will be handed back to the user application at the exact point the page fault occured. It would, unknowing of all of this palaver, perform that instruction again, only this time the MMU will (happily?) output the correct address to the memory system, and all will continue.
So the MMU takes an address, looks it up in the page table, and spits out the correct address.
Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A 32bit address
space has a million pages. With one million pages, you'll need one million entries. In the ARM
MMU, each entry takes 7 words. So we are looking at seven megabytes just to index our memory.
It gets better. Every single memory reference will be passed through the MMU. So we'll
want it to operate in nanoseconds. Faster, if possible.
In reality, it is somewhat easier as most typical machines don't have enough memory to fill the
entire addressing space, indeed many are unlikely to get close on technical reasons (the RiscPC
can have 258Mb maximum RAM, or 514Mb with Kinetic - the extra 2Mb is the VRAM). Even so, the page
tables will get large.
So there are three options:
So far we have figured on the hardware doing all of this, as in the ARM processor. Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss problem to the operating system. This may allow the OS to use some intelligence to pre-load certain pages into the TLB.
Like with harddisc LFAUs, what you need is a sensible trade-off between page granularity and page
size. You could reduce the wastage in memory by making pages small, say 256 bytes. But then you
would need a lot of memory to store the page table. A bigger page table, slower to scan through
it. Or you could have 64K pages, which make the page table small, but can waste huge amounts of
memory.
To consider, a 32K program would require eight 4K pages, or sixty four 512 byte pages. If
your system remaps memory when shuffling pages around, it is quicker to move a smaller number of
large pages than a larger number of small pages.
The MEMC in older RISC OS machines had a fixed page table. So the size of page depended upon how much memory was utilised.
MEMORY | PAGE SIZE |
0.5Mb | 8K |
1Mb | 8K |
2Mb | 16K |
4Mb | 32K |
Most commercial systems use page sizes in the order 512 bytes to 64K.
The later ARM processors (ARM6 onwards) and the Intel Pentium both use page sizes of 4K.
Not Recently Used
This requires two bits to be reserved in the page table, a bit for read/write and a bit for
page reference. Upon each access, the paging hardware (and it must be done in hardware for
speed) will set the bits as necessary. Then on a fixed interval the operating system will clear
these bits - either when idling or upon clock interrupt? This then allows you to track the
recent page accesses, so when flushing out a page you can spot those that have not recently been
read/written or referenced. NRU would remove a page at random. While it is not the best way of
sorting out which pages to remove, it is simple and gives reasonably good results.
First-In First-Out
It is hoped you are familiar with the concept of FIFO, from buffering and the like. If you are
not, consider the lame analogy of the hose pipe in which the first water in will be the first
water to come out the other end. It is rarely used, I'll leave the whys and where-fores as an
exercise for the bemused reader. :-)
Second Chance
A simple modification to the FIFO arrangement is to look at the access bit, and if it is zero
then we know the page is not in current use and can be thrown. If the bit is set, then the page
is shifted to the end of the page list as if it was a new access, and the page search continues.
What we are doing here is looking for a page unused since the last period (clock tick?). If by
some miracle ALL the pages are current and active, then Second Change will revert to FIFO.
Clock Although Second Chance is good, all that page shuffling is inefficient so the pages are instead referenced in a circular list (ie, clock). If the page being examined in in use, we move on and look at the next page. With no concept of the start and end of the list, we just keep going until we come to a usable page.
Least Recently Used
LRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the most
recently used at the front of the list, to the least recently used at the back. When you need a
page, you pull the last entry and use it. Because of speed, this is only really possible in
hardware as the list should be updated each memory access.
Not Frequently Used
In an attempt to simulate LRU in software, we can maintain something vaguely similar to LRU in
a software implementation, in which the OS scans the available pages on each clock tick and
increments a counter (held in memory, one for each page) depending on the read/written bit.
Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such as a
rendering core) will have a high count for quite a while. Then, code that is not called often but
should be all the more responsive, such as redraw code, will have a lower count and thus stand
the possibility of being kicked out, even though the higher-rated renderer is no longer needed
but not kicked out as it's count is higher.
But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just before the
count is incremented, it is shifted one bit to the right. So after a number of shifts the count
will be zero unless the bit is added. Here you might be wondering how adding a bit can work, if
you've just shifted a bit off. The answer is simple. The added bit is added to the leftmost
position, ie most significant.
The make this clearer...
Once upon a time: 0 0 1 0 1 1 Clock tick : 0 0 0 1 0 1 Clock tick : 0 0 0 0 1 0 Memory accessed : 1 0 0 0 0 1 Clock tick : 0 1 0 0 0 0 Memory accessed : 1 0 1 0 0 0
However, it is possible to provide the illusion of running several things at once. In the old days, things happened in the background under interrupt control. Keyboards were scanned, clocks were updated. As computers became more powerful, more stuff happened in the background. Hugo Fiennes wrote a soundtracker player that runs on interrupts, so works in the background. You set it going, it carries on independent of your code.
So people began to think of the ability to apply this to applications. After all, most of the
time an application is spent waiting for user input. In fact, the application may easily do sweet
sod all for almost 100% of the time - measured by an event counter in Emily's polling loop, I
type ~1 character a second, the RiscPC polls a few hundred times a second. That was measured in
a multitasking application, using polling speed as a yardstick. Imagine if we were to record
loops in a single-tasking program. So the idea was arrived at. We can load several programs into
memory, provide them some standard facilities and messaging systems, and then let them run for a
predefined duration. When the duration is up, we pass control to the next program. When that has
used its time, we go to the next program, and so on.
As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but an
extremely important one. It is physically impossible to measure system polling speed in software,
and pretty difficult to measure it in hardware. You see, the very act of performing your
measurement will affect the results. And you cannot easily 'account' for the time taken to make
your measurements because measuring yourself is subject to the same artefacts as when measuring
other things. You can only say 'to hell with it', and have your program report your polling rate
as being 379 polls/sec, knowing that your measuring code may be eating around 20% of the
available time, and use the figures in a relative form rather than trying to state "My
computer achieves 379 polls every second". While there is no untruth in that, your computer
might do 450 if you weren't so busy watching! You simply can't be JAFO.
...and you need to go to school/college and get bored rigid to find out what relevance any of
this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware of all these
heavy scientific concepts. She's probably got the right idea...
I don't wish to get into an advocacy war here. My personal preference is co-operative, however I
don't feel that either is the answer. Rather, a hybrid using both technologies could make for a
clean system. The major drawback of CMT is that if an application dies and goes into a
never-ending loop, control won't come back. The application needs to be forceably killed off.
Niall Douglas wrote a pre-emption system for
RISC OS applications. Surprisingly, you didn't really notice anything much until an application
entered some heavy processing (say, ChangeFSI) at which point life carried right on as normal
while the task which would have stalled the machine for a while chugged away in the background.