Memory schemes
and
multitasking

Introduction

This is a reference, designed to help you understand the various types of memory handling and multitasking that exist.

Memory is a resource that needs careful management. It is expensive (£/Mb is much higher for memory than for conventional harddisc storage). A good system will offer flexible facilities trading off speed for functionality.
You need memory because it is fast. It is rarely as fast as the processor, these days, but it is faster than harddiscs. Because we need fast. We need big, so we can hold these large programs and large amounts of data that seem to be around. It boggles the mind that a commercial mainframe did accounts and stuff with a mere 4K of memory.

Typically, there will be three or four, possibly five, kinds of storage in the computer.

Level 1 cache
This is inside the processor, usually operating at the core speed of the processor. It is between 4K and 32K usually.
Level 2 cache
If the difference between the processor speed and system memory is quite large, you will often have a level 2 cache. This is mounted on the motherboard, and typically runs at a speed roughly halfway between the processor speed and the speed of the system memory.
It is usually between 64K and 512K. RISC OS machines do not have Level 2 cache.
Level 3 cache
If your processor is running at some silly speed (such as 1GHz) and your system memory is running at a tenth of that, you might like a chunk (say a Mb or two) of cache between level 2 and system memory, so that you can further improve speed.
Each layer of cache is getting slower, until we reach...
System memory
Your DRAM, SRAM, SIMMs, DIMMs, or whatever you have fitted. Speeds range from 2MHz in the old home computers, to around 133MHz in a typical PC compatible. Older PCs use 33MHz or 66MHz buses.
The ARM2/250 machines have an 8MHz bus, the ARM3 machines (A5000,...) have a 12MHz bus, the RiscPC has a 16MHz bus. In these cases, only the ARM2 is clocked at the same speed as the bus. The ARM3 is clocked at 25 or 30MHz, the ARM610 at 33MHz, the ARM710 at 40MHz and the StrongARM at a variety of speeds up to 280-ish MHz.
Harddisc
Slow, huge, cheap.

Basic monoprogramming

This is where all of the memory is just available, and you run one application at a time. The kernel/OS/BIOS (whatever) sits in one place, either in RAM or ROM and it is mapped into the address map.

Consider:

   .----------------.          .----------------.
   |   OS in ROM    |          | Device drivers |
   |                |          |     in ROM     |
   |----------------|          |----------------|
   |                |          |                |
   |      Your      |          |      Your      |
   |  application   |          |  application   |
   |                |          |----------------|
   |----------------|          |                |
   |System workspace|          |   OS in RAM    |
   '----------------'          '----------------'

The first example is similar to the layout of the BBC microcomputer. The second is not that different to a basic MS-DOS system, the OS is loaded low in memory, the BIOS is mapped in at the top, and the application sits in the middle.

To be honest, the first example is used a lot under RISC OS as well. It is exactly what a standard application is supposed to believe. The OS uses page zero (&0000 - &7FFF) for internal housekeeping, it (your app) begins at &8000, and the hardware/OS sit way up in the ether at &3800000.
Memory management under RISC OS is more complex, but this is how a typical application will see things.

When the memory is organised in this way, only one application can be running. When the user enters a command, if it is an application then that application is copied from disc into memory, then it is executed. When the application is done with, the operating system reappears, waiting for you to give it something else to do.

Basic multiprogramming

Here, we are running several applications. While they are not running concurrently (to do so would be impossible, a processor can only do one thing at a time), the amount of time given to an application is tiny, so the system is spending a lot of time faffing around hopping from one application to the next, all giving you the illusion that n applications are all happily running together on your computer.

Memory is typically handled as non-contiguous blocks. On an ARM machine, pages are brought together to fake a chunk of memory beginning at &8000. Anybody who has tried an address translation in their allocated memory will know two things. Firstly, it is near impossible to get an actual physical memory address out of the OS.
The following program demonstrates this:

  END = &10000 : REM Constrain slot to 32K

  DIM willow% 16
  SYS "Wimp_SlotSize", -1, -1 TO slot%
  SYS "OS_ReadMemMapInfo" TO page%

  PRINT "Using "+STR$(slot% / page%)+" pages, each page being "+STR$(page%)+" bytes."
  PRINT "Pages used: ";

  more% = slot% / page%
  FOR loop% = 0 TO (more% - 1)
    willow%!0 = 0
    willow%!4 = &8000 + (loop% * page%)
    willow%!8 = 0
    willow%!12= -1
    SYS "OS_FindMemMapEntries", willow%
    IF loop% > 0 THEN PRINT ", ";
    PRINT STR$(willow%!0);
  NEXT
  PRINT
  END

This outputs something similar to:

  Using 8 pages, each page being 4096 bytes.
  Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678

RISC OS handles memory by loading everything into memory. These applications are then 'paged in' by remapping the memory pointers in the page tables, consequently, other tasks are mapped out.

Windows/Unix systems load applications into memory, supported by a system called 'virtual memory' which dumps unused pages to disc in order to free system memory for applications that need it. I am not sure how Windows organises its memory, if it does it in a style similar to RISC OS (ie, remap to start from a specific address) or if each application is just told 'you are here'.
Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you are careful how you load it, and swap out old parts for new parts as necessary.

Some systems use a lazy-paging form of memory. In this case, only the first page of memory is filled by the application when execution starts. As more of the application is executed, the operating system fills in the parts as required.
By contrast, under RISC OS an application needs to load. Consider loading, well, practically anything, off of floppy disc. It takes time.

Virtual memory

When you no longer have actual physical memory, you may have virtual memory. A set of memory locations that don't exist, but the operating system tries real hard to convince you they do. And in the centre of the ring is the MMU (Memory Management Unit, inspired name, no?) keeping control
[note: you need an MMU anyway when your memory is broken into remappable pages, this just seemed like a good time to introduce it!]

When the processor is instructed to jump to &8000 to begin executing an application, it passes the address &8000 to the MMU. This translates the address into the correct real address and outputs this on the address lines, say &12FC00. The processor is not aware of this, the application is not aware of this, the computer user is not aware of this.

So we can take this one stage further by mapping onwards into memory that does not exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the operating system will be called in a panic (correctly known as a "page fault"). The operating system will be calm and collected and think, "Ah, virtual memory". A little-used page of real memory will be shoved out to disc, then the page that the MMU was trying to find will be reloaded in place of the page we just got rid of. The memory map will be updated accordingly, then control will be handed back to the user application at the exact point the page fault occured. It would, unknowing of all of this palaver, perform that instruction again, only this time the MMU will (happily?) output the correct address to the memory system, and all will continue.

Page tables and the MMU

The page table exists to map each page into an address. This allows the operating system to keep track of which memory is pretending to be which. However it is more complex. Some pages cannot be remapped, some pages are doubly mapped, some are not to be touched in user mode code, some aren't to be touched at all. Some are read only. Some just don't exist. All of this must be kept track of.

So the MMU takes an address, looks it up in the page table, and spits out the correct address.

Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A 32bit address space has a million pages. With one million pages, you'll need one million entries. In the ARM MMU, each entry takes 7 words. So we are looking at seven megabytes just to index our memory.
It gets better. Every single memory reference will be passed through the MMU. So we'll want it to operate in nanoseconds. Faster, if possible.
In reality, it is somewhat easier as most typical machines don't have enough memory to fill the entire addressing space, indeed many are unlikely to get close on technical reasons (the RiscPC can have 258Mb maximum RAM, or 514Mb with Kinetic - the extra 2Mb is the VRAM). Even so, the page tables will get large.

So there are three options:

Have a huge array of fast registers in the MMU. Costly. Very.
Hold the page tables in main memory. Slow. Very.
Compromise. Cache the active pages in the MMU, and store the rest on disc.

An example. A RiscPC, 64Mb of RAM, 2Mb of VRAM, 4Mb of ROM and hardware I/O (double mapped). That's 734000320 bytes, or 17920 pages. It would take 71680 bytes to store each address. But an address on it's own isn't much use. Seven words comprise an entry in the ARM's MMU. So our 17920 pages would require 501760 bytes in order to fully index the memory.
You just can't store that lot in the MMU. So you'll store a snippet, say 16K worth?, and keep the rest in RAM.

The TLB

The Translation Lookaside Buffer is a way to make paging even more responsive. Typically, a program will make heavy use of a few pages and barely touch the rest. Even if you plan to byte read the entire memory map, you will be making four thousand hits in one page before going to the next.
A solution to this is to fit a little bit in the MMU that can map virtual addresses to their physical counterparts without traversing the page table. This is the TLB. It lives within the MMU and contains details of a small number of pages (usually between four and sixty four - the ARM610 MMU TLB has thirty two entries).
Now, when we have a page lookup, we first pass our virtual address to the TLB which will check all of the addresses stored, and the protection level. If a match is found, the TLB will spit out the physical address and the page table isn't touched.
If a miss is encountered, then the TLB will evict one of it's entries and load in the page information looked up in the page table, so the TLB will know the new page requested, so it can quickly satisfy the result for the next memory access, as chances are the next access will be in the page just requested.

So far we have figured on the hardware doing all of this, as in the ARM processor. Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss problem to the operating system. This may allow the OS to use some intelligence to pre-load certain pages into the TLB.

Page size

Users of an RISC OS 3.5 system running on an ARM610 with two or more large (say, 20Mb) applications running will know the value of a 4K page. Because it's bloody slow. To be fair, this isn't the fault of the hardware, but more the WIMP doing stuff the kernel should do (as happens in RISC OS 3.7) and doing it slower!

Like with harddisc LFAUs, what you need is a sensible trade-off between page granularity and page size. You could reduce the wastage in memory by making pages small, say 256 bytes. But then you would need a lot of memory to store the page table. A bigger page table, slower to scan through it. Or you could have 64K pages, which make the page table small, but can waste huge amounts of memory.
To consider, a 32K program would require eight 4K pages, or sixty four 512 byte pages. If your system remaps memory when shuffling pages around, it is quicker to move a smaller number of large pages than a larger number of small pages.

The MEMC in older RISC OS machines had a fixed page table. So the size of page depended upon how much memory was utilised.

MEMORY	PAGE SIZE
0.5Mb	8K
1Mb	8K
2Mb	16K
4Mb	32K

3Mb wasn't a valid option, and 4Mb is the limit. You can increase this by fitting a slave MEMC, in which case you are looking at 2 lots of 4Mb (invisible to the OS/user).
In a RiscPC, the MMU accesses a number of 4K pages. The limits are due, I suspect, to the system bus or memory system, not the MMU itself.

Most commercial systems use page sizes in the order 512 bytes to 64K.
The later ARM processors (ARM6 onwards) and the Intel Pentium both use page sizes of 4K.

Page replacement algorithms

When a page fault occurs, the operating system has to pick a page to dump, to allow the required page to be loaded. There are several ways that this may be achieved. None of these are perfect, they are a compromise of efficiency.

Not Recently Used
This requires two bits to be reserved in the page table, a bit for read/write and a bit for page reference. Upon each access, the paging hardware (and it must be done in hardware for speed) will set the bits as necessary. Then on a fixed interval the operating system will clear these bits - either when idling or upon clock interrupt? This then allows you to track the recent page accesses, so when flushing out a page you can spot those that have not recently been read/written or referenced. NRU would remove a page at random. While it is not the best way of sorting out which pages to remove, it is simple and gives reasonably good results.

First-In First-Out
It is hoped you are familiar with the concept of FIFO, from buffering and the like. If you are not, consider the lame analogy of the hose pipe in which the first water in will be the first water to come out the other end. It is rarely used, I'll leave the whys and where-fores as an exercise for the bemused reader. :-)

Second Chance
A simple modification to the FIFO arrangement is to look at the access bit, and if it is zero then we know the page is not in current use and can be thrown. If the bit is set, then the page is shifted to the end of the page list as if it was a new access, and the page search continues.
What we are doing here is looking for a page unused since the last period (clock tick?). If by some miracle ALL the pages are current and active, then Second Change will revert to FIFO.

Clock Although Second Chance is good, all that page shuffling is inefficient so the pages are instead referenced in a circular list (ie, clock). If the page being examined in in use, we move on and look at the next page. With no concept of the start and end of the list, we just keep going until we come to a usable page.

Least Recently Used
LRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the most recently used at the front of the list, to the least recently used at the back. When you need a page, you pull the last entry and use it. Because of speed, this is only really possible in hardware as the list should be updated each memory access.

Not Frequently Used
In an attempt to simulate LRU in software, we can maintain something vaguely similar to LRU in a software implementation, in which the OS scans the available pages on each clock tick and increments a counter (held in memory, one for each page) depending on the read/written bit.
Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such as a rendering core) will have a high count for quite a while. Then, code that is not called often but should be all the more responsive, such as redraw code, will have a lower count and thus stand the possibility of being kicked out, even though the higher-rated renderer is no longer needed but not kicked out as it's count is higher.
But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just before the count is incremented, it is shifted one bit to the right. So after a number of shifts the count will be zero unless the bit is added. Here you might be wondering how adding a bit can work, if you've just shifted a bit off. The answer is simple. The added bit is added to the leftmost position, ie most significant.
The make this clearer...

   Once upon a time:     0 0 1 0 1 1
   Clock tick      :     0 0 0 1 0 1
   Clock tick      :     0 0 0 0 1 0
   Memory accessed :     1 0 0 0 0 1
   Clock tick      :     0 1 0 0 0 0
   Memory accessed :     1 0 1 0 0 0

Multitasking

There is no such thing as true multitasking (despite what they may claim in the advocacy newsgroups). To multitask properly, you need a processor per process, with all the relevant bits so processes are not kept waiting. Effectively, a separate computer for each task.

However, it is possible to provide the illusion of running several things at once. In the old days, things happened in the background under interrupt control. Keyboards were scanned, clocks were updated. As computers became more powerful, more stuff happened in the background. Hugo Fiennes wrote a soundtracker player that runs on interrupts, so works in the background. You set it going, it carries on independent of your code.

So people began to think of the ability to apply this to applications. After all, most of the time an application is spent waiting for user input. In fact, the application may easily do sweet sod all for almost 100% of the time - measured by an event counter in Emily's polling loop, I type ~1 character a second, the RiscPC polls a few hundred times a second. That was measured in a multitasking application, using polling speed as a yardstick. Imagine if we were to record loops in a single-tasking program. So the idea was arrived at. We can load several programs into memory, provide them some standard facilities and messaging systems, and then let them run for a predefined duration. When the duration is up, we pass control to the next program. When that has used its time, we go to the next program, and so on.
As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but an extremely important one. It is physically impossible to measure system polling speed in software, and pretty difficult to measure it in hardware. You see, the very act of performing your measurement will affect the results. And you cannot easily 'account' for the time taken to make your measurements because measuring yourself is subject to the same artefacts as when measuring other things. You can only say 'to hell with it', and have your program report your polling rate as being 379 polls/sec, knowing that your measuring code may be eating around 20% of the available time, and use the figures in a relative form rather than trying to state "My computer achieves 379 polls every second". While there is no untruth in that, your computer might do 450 if you weren't so busy watching! You simply can't be JAFO.
...and you need to go to school/college and get bored rigid to find out what relevance any of this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware of all these heavy scientific concepts. She's probably got the right idea...

Co-operative multitasking

One such way of multitasking is relatively clean and simple. The application, once control has passed to it, has full control for as long as it needs. When it has finished, control is explicitly passed back to the operating system.
This is the multitasking scheme used in RISC OS.

Pre-emptive multitasking

Seen as the cure to all the world's ills by many advocates who have seen Linux (not Windows!), this works differently. Your application is given a timeslice. You can process whatever you want in your timeslice. When your timeslice is up, control is wrested away and given to another process. You have no say in the matter, peon.

I don't wish to get into an advocacy war here. My personal preference is co-operative, however I don't feel that either is the answer. Rather, a hybrid using both technologies could make for a clean system. The major drawback of CMT is that if an application dies and goes into a never-ending loop, control won't come back. The application needs to be forceably killed off.
Niall Douglas wrote a pre-emption system for RISC OS applications. Surprisingly, you didn't really notice anything much until an application entered some heavy processing (say, ChangeFSI) at which point life carried right on as normal while the task which would have stalled the machine for a while chugged away in the background.

Return to assembler index

Memory schemes and multitasking