Rick's b.log - entry 2019/06/30

mailto: blog -at- heyrick -dot- eu

Raspberry Pi 4

So the Pi 4 has arrived. Once upon a time, your thirty five American dollars would have bought you a 256MiB ARM11 device. Now? That same $35 would buy you a Pi4, the sixth incarnation of the Pi family (if you count the Zero and Compute module). Astonishingly, Farnell UK offers a price that is very nearly a straight conversion of $35. However, more predictably, the official French retailer (Kubii) loads about €7 into the price, which is a lot when it's supposed to be €31ish - that's like a third extra. The 4GiB model is about €10 over what a straight $ conversion would be.

Anyway, the Pi 4... It is now available in three models, differing in how much memory is onboard - with 1GiB onboard, with 2GiB onboard, and with 4GiB onboard. The price is around $10/€10 for each increment. For some reason, the Pi Foundation figured that the best model for everyone would be the 2GiB model. And with complete and utter predictability, the 4GiB ones have sold out practically everywhere. And, also with utter predictability, the 4GiB models are on eBay with a price in the ballpark of eighty quid. It's basically a replay of the Pi Zero debacle, isn't it?

Apparently the little handbook that comes with the Pi makes reference to an 8GiB model. That may come, but there are issues dealing with that much memory on a 32 bit setup, as I believe Raspian still is.

Now while the Pi may or may not be "the best ARM SBC" (it's highly subjective), one thing is for certain - they keep innovating and they keep battering down the competition. The original Pi basically broke the Beagleboard - okay, it was a lower spec for sure, but until then it was practically impossible to get an ARM board for less than the cost of a petrol mower or x86 netbook. This wasn't a sub-$100 board, this was a sub-$50 board.
The original model B (~256MiB, ARM11, 700MHz) was followed by a cut-down model A, and a slightly superior model B+. Practically a most blatant nod to the BBC Micro (that tried to bring computing to the masses in the '80s).

The next machine to be released was the Pi 2. This was basically the same board and form factor as the B+, but its single ARM11 core had been replaced by a quad core ARMv7 (Cortex-A7) running at 900MHz, and memory standardised at 1GiB. The price stayed the same, so your nominal $35 got you a quad core machine.

This was followed by the Pi Zero that, basically, broke everything. Having some I/O, an HDMI output, a single USB port, and micro USB for power, that was about it. However with some extras (like a powered USB hub) it was perfectly capable of acting as a media server. I've had OSMC running on mine. The UI is a little laggy, but playback is good. I tried one of those Google media dongles the other day, and its UI was even slower! But why did the Pi Zero break everything? Well, because it cost a fiver. Yup, a fiver. A shade more than a Happy Meal for a fully functional Pi-Lite.

Later versions of the Pi2 used the same processor core introduced with the Pi 3 - a 1.2GHz ARMv8 quad Cortex-A53 (underclocked on the Pi2). The Pi 3 also added onboard Bluetooth and WiFi. So your $35 now got you a 64 bit capable machine with built-in connectivity.

It was hard to see where the Pi family could go from here. The 3+ jacked up the speed a little (1.4GHz) but there was actually a lot wrong with the Pi design. Primarily, pretty much everything hung off of one USB port on the SoC. The 3+ offered gigabit ethernet, but there wasn't much point as it was bandwidth shared with the USB peripherals so while it was better than 100Mbit Ethernet, it sort of maxed out at around 230Mbit. Secondly, the GPU was good and flexible (and way better than the one in the Beagle xM!), but it really only did up to FullHD (officially) and could be coaxed up to 4K(ish) if your monitor would sync to absurdly low refresh rates. Something had to change, and something more than wiring in bigger and better processing cores.

What was necessary was also what was an unexpected surprise. What looks on the outside a lot like "yet another Pi", but what's on the inside is massively different.

Image from https://www.raspberrypi.org/

The processor is a quad Cortex-A72 clocked at 1.5GHz. This will offer some speed increases due to architectural changes and enhancements. In a Linpack FP benchmark, it blows the rest of the Pi family away, being some 3-4 times faster. In a more real-life test (GIMP photo edit benchmark), it's maybe 20% faster than a Pi 3 B+, easily twice as fast as a Pi2, and don't even think about comparisons with the older single core models! The Speedometer 2 browser benchmark shows the Pi 4 to be about twice as fast as the fastest earlier Pi, the Pi 3 B+.
The GPU is now a VideoCore 6 (or "VI"), with dual HDMI output, currently capable of FullHD on both outputs, but will in time offer 4K at 60Hz on one output, or 4K at 30Hz on both. Or any mixture that fits the available bandwidth (4K/1080p, for instance). VideoCore's internals are still closed source, mind you.
USB is now provided via an internal PCIe bus. It's a USB3 chip, but due to technical constraints, two ports are USB3 and two ports are wired as USB2. USB itself remained pretty static with the device (all of them) managing about 40MiB/second flat out, primarily because it was all hanging off of one internal USB port. Not the Pi 4, which has its own PCIe controller. It benchmarked around 320MiB/sec (SSD write) and 460MiB/sec (SSD read). In other words, a phenomenal difference.
Ethernet is no longer passing via USB, so gigabit networking is closer to what it ought to be. I think it was benchmrked a little over 900Mbit, which is pretty good. Certainly it's now possible to transfer data to/from a USB harddisc and network without the annoying bottlenecks that the older models suffered from. The SoC itself now has around 5Gb/sec external bandwidth capability.
As expected now, dual band WiFi and Bluetooth onboard.
Because of the increased power requirements and the plug-in peripherals, the daft micro USB connector has been replaced with a USB C socket. Speaking of sockets, the HDMI sockets are tiny, like on the Pi Zero, and not the big chunky ones.
Memory is also quite a bit faster than in previous models. This will have obvious consequences on the feel of the system in real life tasks - memory throughput is 3-4 times that of a Pi2, and about twice that of a Pi 3 - which when coupled with 4GiB onboard should make it a perfectly acceptable machine for many desktop tasks; it has been reported that using the web version of Google Docs on a Pi4 is not that much different to using Google Docs on a modern laptop costing twenty times the price - as was noted earlier with the significant speed boost in the browser benchmark.
Something that might also be worth noting is the GPIO performance. A small bit of code toggling a GPIO pin on and off measures around 8kHz on a Pi 2, 14-16kHz on a Pi 3, and a rather astonishing >50kHz on a Pi 4.

That's not to say it's perfect. Looking around online, pretty much everybody seems to be agreeing, it runs hot. I have set my Pi2 to throttle itself back if the core temperature passes 55°C. Well, the Pi 4 will blow past that doing nothing at all. A journo writing a review in LibreOffice on a Pi 4 noted that it nearly touched 80°C just doing that. Heat dissipation is centred on the SoC, obviously, but also on the power controller (just by the USB C socket).
More powerful hardware that generates more heat means only one thing. That you'll need active cooling. Not just a passive heatsink, but a heatsink and a fan.
So it stands to reason that it's a more power hungry device. However much you think the Pi 3 wanted? Well, the Pi 4 wants about twice that.
And, of course, the headline 4K video support doesn't work. Yet.
Also worth noting is that the hardware support 4K content is only for H.265 (HEVC); hardware H.264 decoding is limited to FullHD, and hardware support for earlier codecs (MPEG2 and H.263 (XviD era)) has been dropped entirely as it is considered that the processor is powerful enough to decode those in software.

The Pi Foundation have done it again. Your $35 (or ~$55 for 4GiB) buys you a device that isn't that unlike a flagship smartphone of a few years ago. And while the Pi isn't the most powerful ARM board (NanoPC?) or the cheapest useful one (Rock64 media board?), it is a solid offering that won't break the bank and has a massive amount of support. RISC OS is available for the Pi1-3, and the Pi4 version is being worked on. There's no official version for the many clones (BananaPi, OrangePi, etc).

To give an example of what I mean, the Beagleboard xM might be a much more open design (the OMAP3 technical data is a thing of beauty that all datasheets should aspire to match), the operating systems that I know about are a broken Angstrom Linux image, RISC OS, and maybe a version of Debian if you can follow some complicated instructions.
The Pi, on the other hand, has numerous (mostly Linux-based) operating systems and some out of the box "does stuff" like Kodi (in which the underlying OS isn't actually that important), plus various unofficial sort-of operating systems like Plan9 (if you can find it), and Android (non-free via emteria.OS). And, of course, RISC OS.

Why I won't be getting one....yet

It's not that RISC OS doesn't currently work on it. It's that the machine I'm using keeps up with typing, keeps up with DTP work, is running the webserver in the background, is light on energy use, and has 959,172KiB memory free.
That's 941MiB of my 1GiB machine, unused.

Now, the Pi 4 may be an interesting thing to use to play with some sort of Linux - 4GiB and dual screen probably trounces a lot of lower end laptops - and those sorts of capabilities ought to make a perfectly usable (if not blazing) system to replace a large noisy power-hungry PC. It's a setup sweeter than a fair few Android tablets, and at a lower cost. The Pi 4 could genuinely replace the Big Beige Box for a lot of users who don't do complicated things and aren't specifically tied to one ecology.

But for RISC OS?
No way.
What the hell do I need all of that for? Instead of 941MiB unused, it'd be 4013MiB (or 4,109,312KiB) unused.
RISC OS isn't demanding. It doesn't do demanding things. My ARMv7 Pi2 is completely adequate.

Another ARM board

Just "for the hell of it", I picked up two of these from Amazon. Costing a few euros including shipping from China, they arrived sooner than expected:

It is a (widely cloned) STM32F103 device, known as "Blue Pill". Specifically it is a Cortex-M3 core running at 72MHz. Officially 64KiB Flash (though a lot of sources suggest there's actually 128KiB), and 20KiB RAM. Like the ESP32 there's a lot of flexible I/O. Unlike the ESP32, no WiFi. 3 timers, 2 SPI (36MHz for SPI1 (overspec but works), 18MHz for SPI2), 2 IIC, 3 USART, 1 USB, 1 CAN, 37 GPIO (but this depends on what the board has actually brought out!), 2 ADC (10 channel).
The M3 core is an ARMv7 class Thumb-1/Thumb-2 with 32 bit hardware multiply and divide and support for saturating arithmetic, NMI, and a 12 cycle interrupt latency.

That said, the device isn't without complications. The USB port is useless, apparently one must burn in a bootloader, otherwise it's a messy serial protocol. Usefully, it looks as if David Pilling has already been here, so some things to read.

Why? I dunno. I must be masochistic wanting to inflict that Arduino IDE on myself again... Thing is, for the moment this is a processor looking for a solution. I don't dabble much in the embedded world...

Code efficiency

In my last article, Gavin pointed out that Norcroft and GCC have different ways of handling jump tables.

Here is a simple example taken from a random location in NetSurf:

000D5DC8 : ..Sã : E3530004 : CMP     R3,#4              ; Compare R3 with 4
000D5DCC : .ñﬂ– : 979FF103 : LDRLS   PC,[PC,R3,LSL #2]  ; If LESS, load PC from PC + (R3 << 2) [+ 8]
000D5DD0 : 1..ê : EA000031 : B       &000D5E9C          ; If >=, do this
000D5DD4 : ø^.. : 000D5EF8 : Undefined instruction      ; Address if R3 is zero
000D5DD8 : è].. : 000D5DE8 : ANDEQ   R5,R13,R8,ROR #27  ; Address if R3 is one
000D5DDC : 8_.. : 000D5F38 : ANDEQ   R5,R13,R8,LSR PC   ; Address if R3 is two
000D5DE0 : ¤^.. : 000D5EA4 : ANDEQ   R5,R13,R4,LSR #29  ; Address if R3 is three

How this works is to compare R3 with 4. If it is less, then load a word of data into PC from the address PC + (R3 << 2). The shift by two is because the ARM expects instructions to be word aligned, and a word is 32 bits or four bytes. Thus, the shift by two is the same as multiplying by four. Consequently, this changes R3 to reference sequential word addresses - making 0, 1, 2, and 3 become 0, 4, 8, and 12 respectively. Added to PC, this points at each word in the jump table.
The branch afterwards is the "otherwise" behaviour.
This is followed by four words giving the addresses for if R3 is 0, 1, 2, or 3.

You might be wondering how it is possible to skip instructions. This is effectively what is happening if R3 is zero. You're loading the word from the address of PC + 0 and it's the first item of the jump table after the branch instruction.
The reason of how this works is a quirk of the ARM processor, in that the value of PC is always two instructions (or two words) ahead of the instruction currently being executed. Think of it as a very literal interpretation of the Fetch - Execute - Decode cycle, where PC follows the Fetch and the Execute lags two instructions behind.

This is fairly efficient as you're doing a single direct branch by loading an address straight into PC, and it is capable of accessing the entire (4GiB) addressing range of the machine as the data is a complete 32 bit word. It is, however, not position independent. So it either needs to be fixed up somehow at application start, or the linker needs to ensure that the code never ever moves in location - this is usually acceptable for applications that are always based at &8000, but useless for Relocatable Modules (clue in the name!).

The Norcroft method looks like this:

00000FD4 : ..Rã : E3520014 : CMP     R2,#&14            ; Compare R2 with 20
00000FD8 : .ñ•. : 908FF102 : ADDLS   PC,PC,R2,LSL #2    ; If LESS, PC = PC + (R2 << 2) [+ 8]
00000FDC : ...ê : EA00001C : B       &00001054          ; If >=, go here
00000FE0 : ...ê : EA00001D : B       &0000105C          ; Branch for if R2 is zero
00000FE4 : 0..ê : EA000030 : B       &000010AC          ; Branch for if R2 is one
00000FE8 : 2..ê : EA000032 : B       &000010B8          ; Branch for if R2 is two
[etc]

How this works is to compare R2 with 20. If less, then calculate PC to be PC + (R2 << 2). This will cause PC to branch to the relevant location, which contains a branch instruction to the desired target address. Because a branch instruction is used, the address has to fit within the instruction. It's a 24 bit value, but since the lower two bits are zero (word alignment), it is shifted two places to the left to be a 26 bit address. Which means it can address +/- 32MiB.

As before, PC always two instructions on from the address currently being executed. So the ADDLS, if R2 is zero, takes the current address + 2 words, adds nothing to it, and writes it back to PC causing a branch to the fourth instruction down (which is the ADDLS plus two). Thus, the first branch is the "otherwise" clause. Another thing to note is that a branch instruction does not encode an address. It encodes a signed number of words from the value of PC at the point of the instruction (not the address of the instruction!). So if we take the last of them, it's &FE8 + 8 + (&32 * 4) which is &10B8, the target of the branch.

This is a less efficient operation as one branches (in the ADDLS) and then branches again, however it is entirely position independent. The reason the address in the left column is so low is because this code is taken from one of my modules. As modules are loaded into the RMA in "the next available place", fixed addresses cannot be used. It must be a jump table constructed in this manner.
Modules written in assembler usually use some sort of variation on this.

Should the Norcroft compiler be able to switch itself to use the GCC style jump table in application code? Perhaps, but I'm not sure it matters much on a multitasking system with many interrupts running in the background. Jump tables only seem to be used if there are five or more sequential entries in a switch() statement.

Norcroft compiler CPU/architecture types

The following table is the CPUs and architectures that the Norcroft compiler (v5.77-5.78) recognise. This is because there's no clear table anywhere. Unlike objasm, the compiler won't output a list of known types. So I did it the long winded way - Zap and the executable. ☺

The current default is to schedule instructions for best behaviour on an XScale processor, while generating code that will run on an ARMv2 processor.
For targetting modern machines (at the expense of older ones), perhaps -cpu cortex-a8 -arch 5 (if Iyonix, or -arch 6 for the ARM SBCs) might be better? I've not examined any code to see what differences this makes in practice.

The various synonyms are provided for compatibility with the processor types recognised by objasm. They are not set in stone - notice all the Cortex types are synonyms for an A8, and ARMv8-a is a synonym for ARMv7 - I can imagine these will change in the future.

The -fpu option is undocumented (but encouraging to see). You should not use this as all of CLib expects to deal with FPA values (the data is arranged differently between FPA and VFP so they aren't directly compatible).

-cpu (optimise the code for this processor)

arm2, arm3, arm6, arm600, arm610, arm7, arm7m, arm7dm, arm7tm, arm7tdmi, arm700, arm704, arm710, strongarm1, sa-110, sa-1100, sa-1110, arm9tdmi, arm910t, arm920t, arm922t, arm9e-s, arm9ej-s, arm926ej-s, xscale (default), arm1176jzf-s, cortex-a5 (synonym for A8), cortex-a7 (synonym for A8), cortex-a8, cortex-a9 (synonym for A8), cortex-a15 (synonym for A8).

-arch (generate code compatible with this processor)

2 (default), 2a (synonym for 2), 3, 3g (synonym for 3), 3m, 4xm, 4, 4txm, 4t, 5xm, 5, 5txm, 5t, 5texp, 5te, 5tewmmx (synonym for 5tej), 5tewmmx2 (synonym for 5tej), 5tej, 6, 6k, 6t2, 6z (synonym for 6k), 7, 7-a (synonym for 7), 7-r (synonym for 7), 7-a.security (synonym for 7), 8-a (synonym for 7).

-fpu (undocumented - do not use)

none, fp (default), vfp, vfpv3, vfpv3_d16.

Your comments:

Please note that while I check this page every so often, I am not able to control what users write; therefore I disclaim all liability for unpleasant and/or infringing and/or defamatory material. Undesired content will be removed as soon as it is noticed. By leaving a comment, you agree not to post material that is illegal or in bad taste, and you should be aware that the time and your IP address are both recorded, should it be necessary to find out who you are. Oh, and don't bother trying to inline HTML. I'm not that stupid! ☺ ADDING COMMENTS DOES NOT WORK IF READING TRANSLATED VERSIONS.

You can now follow comment additions with the comment RSS feed. This is distinct from the b.log RSS feed, so you can subscribe to one or both as you wish.

Athiest, 1st July 2019, 14:22

You got no idea what you talking about jerk. Go to hell.

Gavin Wraith, 1st July 2019, 22:18

At the expense of using another register, and one more instruction, the jump table type you show from NetSurf can be modified to be position independent:
CMP R3,#max
LDRLS R4,[PC,R3,LSL #2]
ADDLS PC,PC,R4
B otherwise
followed by a table of offsets. Note that no undefined entry is needed now.

Gavin Wraith, 1st July 2019, 22:30

Actually, you could also reuse R3 in place of R4.

I am looking forward to getting an Rpi4 when supplies have settled down. The Rpi3B+ I am using for Raspbian, and the Rpi3 I am using for RISC OS do their jobs very satisfactorily.

Rick, 2nd July 2019, 00:15

Gavin, the "undefined" entry is the first branch address, it just didn't disassemble into anything.

Yes, that looks like an interesting way to make a position independent jump table. There's a part of me that suspects that the double branch (rather than read and jump) might have been faster on something like the ARM 2, and it's just stuck. ;-)

If you get a Pi 4, it looks like pretty much everybody says it runs really hot. You'll need a heatsink, and maybe a fan, as RISC OS is pretty simplistic in how it handles processor speed selection.

Rick, 2nd July 2019, 00:22

"Athiest" -

1, unless you have beef with the God Of Ofla (don't poke Cthulhu, he WILL crush you), I think you're replying to the wrong thing here.

2, Replace "got" with "have", add an "are" after the second "you", and a comma before "jerk".

3, a person claiming to not believe in God finishing their lovingly crafted insult with "go to hell" beggars belief.

4, if you're going to call yourself an atheist, learn to spell it correctly.

<mic drop>

Gavin Wraith, 2nd July 2019, 09:53

Both of us have seen before, on another forum, the sad spectacle of random misspelled pointless invective. A cry for help from a lonely lackwit, or computer-generated by a machine that simulates hate? In either case there is only one reaction possible.

Jeff Doggett, 2nd July 2019, 16:20

"LS" stands for Lower or Same
"LO" = Lower

Gavin, I cannot see how you've allowed for the extra instruction in the "LDRLS R4,[PC,R3,LSL #2]" code. Isn't there an off by one error now?

Rick, 2nd July 2019, 17:07

Well spotted Jeff. Yes, in order for that to work the input would need to begin at one, not zero.

David Pilling, 2nd July 2019, 22:36

Pi4 - I've seen it said they should have used eMMC memory rather than rely on SD card. Apparently the dual HDMI ports are good for one of their key markets - advertising displays. It is said they finished it months early, so it is out without full software support.
Blue Pill - OK, more computing power than typical ATMega 8 bit processors. But it's always what you want to do. If you're happy tinkering to get tiny power use, than 8 bit processor is good. The Blue Pill has good, fast ADCs built in - couple of MHz, that's why it is found in those cheap o'scopes like you built here some time back.
3D print world/CNC has been built around 8 bit Arduinos and clever coding, things are moving on, going to ARM 32 bit based boards. Point being they use the Atom editor and platformio for build.
The ESP was a nightmare until people made boards that did all the boot loader stuff.

David Pilling, 2nd July 2019, 22:44

Do Atheists believe in Hell.

athy
adjective -- prone to examining beliefs critically and skeptically (neologism based on the common misspelling of 'atheist' as 'athiest')
If you accept most beliefs unquestioningly, you're not very athy. If you do so with most beliefs, yet have a certain set of beliefs you refuse to examine critically, you're athier. If you apply critical examination to ALL beliefs, then you are athiest.

(Not to be confused with the noun 'atheist', which means 'somebody who does not believe in god(s)'.)

My YouTube channel

Names of things

(Felicity? Marte? Find out!)

Mom (1948-2019)

(and what went wrong)

Tiny (2004-2016)

Get Ovation DTP v1.55

📺 The SIBA stories 📹

List all b.log entries

Return to the site index

Alphabetical:

Search Rick's b.log!

PS: Don't try to be clever.
It's a simple substring match.

Last read at 14:04 on 2024/11/21.

[ b.log2 development version log ]

This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted.
RIPA notice: No consent is given for interception of page transmission.

Have you noticed the watermarks on pictures?

Read the explanation.

Next entry - 2019/07/13
Return to top of page

Retrieved from https://heyrick.eu/blog/index.php?diary=20190630 on 21st November 2024