Rick's b.log - entry 2020/06/01

mailto: blog -at- heyrick -dot- eu

EBook for RISC OS - source release

Let's get one thing out of the way right now:

I am not going to work on this project again.
Ever.

A little bit of history. Regular readers to my blog will know that I got myself an ebook reader in the sales a few years ago. Actually, I got two. The first was a PocketBook Basic 2. A couple of weeks later, I got myself a PocketBook Aqua, which was pretty much the same thing but supposedly waterproof (not tested - it doesn't look more than splashproof to be honest) and a touch display instead of buttons. The older eReader I gave to mom, and had to set up Calibre on the PC to translate all the Kindle books that I bought her into something that works on her reader.

And so, I discovered various sites offering epub format books for enjoyable reading - Project Gutenberg, Smashwords, etc.

One day, I looked at an epub file in Zap. It looked... really familiar. So familiar, in fact, that I dropped it onto SparkFS at which point a directory of files appeared. Yup, an epub is basically a zip file.
Ebooks themselves are written in a very restricted form of XHTML and CSS. Restricted in that all the wonderful things that one can do in web browsers are often simply not supported, and restricted in that there is zero fault tolerance. My first attempt at marking up an XHTML file for testing (described in more detail) failed - the Pocketbook reader discarded half of my content over a <br> tag that should have been <br />. Yup, one missing slash and the renderer simply gave up.

This got me to thinking, what would be required to be able to view such books on RISC OS? Obviously a smaller custom HTML parser would suffice, as it doesn't have to cater for all the edge cases that real browsers need to handle. Plus, it is permitted to give up upon hitting an error (although, really... there's such a thing as a recoverable error).

The first part of this phase was to figure out the underlying format of an epub file and to work out how to glue all the pieces together (in the right order) to create a representation of the book's content. And then to simply strip out markup in order to have something that can be seen on the screen.

That is basically what these early versions of EBook do. The initial version stripped out everything, but since it looked rubbish I threw in some really simple interpretation of basic HTML sequences for rudimentary formatting - bold, italics, linebreaks.
Along the way I discovered that while the XHTML might be rigid, there are numerous quirks regarding whitespace, and also how accented characters are handled. Epub is predominantly UTF-8, but some books use the &#xxx; style of markup. I haven't yet come across named glyphs (such as é), but they probably exist too.

And that's as far as I got before abandoning the program.

Why? Well, it was written in something of a frenzy in September 2019, first version released on the 14th, and final version released on the 18th. Regular visitors may recall that September was not a good month. My mother and I spent a lot of time lying to each other about how bad things really were. I released version 0.04 on the 18th. She died eleven days later.
In hindsight, I probably should have sat with her more, talked to her more, something more. I don't know. We both pretended like everything was not so different to normal, so I don't know if she would have wanted to talk. It's why it was chaos after her death, I never asked her where all the "important papers" were, stuff like that. Because she wouldn't have wanted to go there. Talk like that, talk of the end, it was like accepting fate and giving up. And mom was not one for giving up.

Suffice to say, as much as I wanted to have a crack at bringing some lightweight support for ebooks to RISC OS, the timing was completely wrong. This project contains a lot of unpleasant memories, not least the fact that I simply shouldn't have bothered. I should have been strong enough to see the elephant in the room and simply have been with her more.
Because of this, I don't ever plan to work on this project again.

Because of that, I'm releasing the sources. They aren't good, it was basically making it up as I was going along to get something on the screen. However, it does work, if only in a very simple sense. It might serve as a basis for somebody else to tear out my rubbish keyword scanning and drop in a better parser in order to have books that look and feel more like, well, books?

There is another epub reader for RISC OS. MuView, a port from the Linux world. However in common with many ports, it has numerous "issues". Firstly, it is huge. Secondly, it requires quite a lot of resources. Thirdly, on my system it rendered completely empty pages. I'm guessing I am missing fonts or something, but it gave me blank pages instead of telling me something was wrong. Fourthly, it was really slow. It took many seconds per page (even when outputting nothing!). And finally, it apparently crashes on earlier OS builds. The help file mentions this, but for some reason the program startup doesn't bother to check the version of UtilityModule to warn about crashiness. No, it just crashes.

Back in the Acorn days, ANT wrote a browser called Fresco. I use that as my "rough guideline". Anything for RISC OS that deals with ebooks doesn't really have justification to be larger/slower than Fresco. It's not doing anything that much different to a simple-ish web browser, and indeed while the markup may be more involved (there's CSS now), as I said earlier it's a rigid to-the-spec implementation and there are no weird protocols and fetching necessary. Simply loading stuff from a zip file and interpreting it. So, yeah, I feel that a comparison with something along the lines of Fresco are valid.
Some time may be taken when opening files (working out what''s what, what's needed, and paginating) but from that point on, I don't really see why it wouldn't be possible to simply point at the data and throw it to the screen. Page changes shouldn't take more than a hiccup...

The code is split into five modules:

wrapper
This is the module that contains main(). It starts up the program, sets up the Wimp environment, runs the main polling loop, frees memory on exit, and also handles the debugging output (via DADebug).
As it's name implies, this is the "wrapper" around the program. It is where we begin.
splitter
The first part of loading an epub book is by passing it through the splitter. This opens the archive and reads both the manifest and the spine in order to work out what the book is, and what files comprise the book's content.
Once this has been done, the content is extracted piece by piece from all of the parts and is then written out, currently, to a "marked up" text file called "ebookfile" within Scrap.
Marking up means that, while the file is nominally 78 characters wide, it may contain the following:
- UTF-8 (anything that isn't plain ASCII should be expressed as UTF-8).
- [02][xx] to define a style. The style codes are 255 for normal (chosen in lieu of zero because that would mean a null byte in a C string), 1 for italic, 2 for bold, and 3 for bold italic.
There were more styles intended, for switching fonts and marking alternative languages (for supporting, say, Greek or Japanese by switching to the Cyberbit font), but I never got that far.
book
This module simply loads in the marked up ebookfile and scans through it building up a list of pointers to where each line begins.
wimp
This handles the UI. Primarily it performs three tasks - get the UI running (load the resources, windows, etc), handle user interactions, and perform the redraw of the book viewer.
I was planning to split out the redraw code into its own module for v0.05, helps to keep the code tidier. I'd advise you to do the same, especially as and when your redrawing gets a little more complicated than simply writing out a line of text (with optional markup).
support
Some support routines. A substr function that I'm not sure I ever used (I think I intended to in a better parser), a memory dump to file (was useful in determining what was actually in certain arrays), and a function to dump the state of the splitter that was used while developing it to ensure that what was in the arrays matched what was expected to have been there.

The project builds with the DDE. It contains some inline assembly, so will need a recentish version of the compiler (I'd guess anything within the last two years or so would work). It also depends upon DeskLib; more specifically my extended version of DeskLib. If you are a DDE user, then I would recommend it. ☺

ebook004src.zip - EBook v0.04 source, 364KiB

dl230rm_20200511.zip - DeskLib32/RM version of 2020/05/11, 244KiB

The EBook archive comes complete with a prebuilt executable, and I've included an ebook called No Explanations (by Karen Reis) to get you started if you haven't used the program before.
If you want to compile it yourself, install DeskLib if you haven't already, extract the contents of the EBook archive to your harddisc, and then create an o directory within. Then drop the MakeFile onto AMU. It should build with no errors and no warnings.

EBook is licenced under the EUPL. I have included the text of the EUPL version 1.2 with the program.
If you haven't come across the EUPL before, think of it as like the GPL, only without all the restrictions and political bollocks. It is also notable for being a proper multilingual licence - it is available in all of the official languages of the EU and they all have the same legal weight.

A final thought - why is it such a mess trying to work out where each page starts? It looks like Google Docs outputs <hr style="page-break-before:always; [...]>, but not always. Some things just flow on without break markers. Others... have nothing.
Now, I don't expect to have absolute page numbering given that an ebook flows from one screenful to the next, however it does seem strange that a format intended for the display of books did not include some way of referencing the page (as related to the print version).
My Pocketbook reader shows pages, but it is odd. I may read page 127 (of 318) for three screenfuls before it increments to 128. I guess perhaps it is running a fake render against the page size of a normal book, or something? I don't know. It seems like a percentage would be better, but given that you're usually asked for a page number when jumping around in the book, it might have been useful for this to have been reliably embedded into the content?
PS: Don't get me started on the rubbish that Google Docs outputs when asked to export as epub. On a system smart enough to ignore the broken text sizing, it's all run together. On a sytem that obeys what it is told (like my PocketBook reader), things are laid out better (although a 97 page book is showing as 58 pages...?), some things are run together while others are not (even when the original has hard pagebreaks). But, more importantly, the text size is fixed and cannot be changed by pinching in or out (thus, fundamentally, breaking how an ereader works) making it look like a PDF...with about fifty lines down an 800 pixel screen. It's just... badly broken. Even forcing my own font and size doesn't work (it's using AdobeViewer).

Your comments:

Please note that while I check this page every so often, I am not able to control what users write; therefore I disclaim all liability for unpleasant and/or infringing and/or defamatory material. Undesired content will be removed as soon as it is noticed. By leaving a comment, you agree not to post material that is illegal or in bad taste, and you should be aware that the time and your IP address are both recorded, should it be necessary to find out who you are. Oh, and don't bother trying to inline HTML. I'm not that stupid! ☺ ADDING COMMENTS DOES NOT WORK IF READING TRANSLATED VERSIONS.

You can now follow comment additions with the comment RSS feed. This is distinct from the b.log RSS feed, so you can subscribe to one or both as you wish.

No comments yet...

My YouTube channel

Names of things

(Felicity? Marte? Find out!)

Mom (1948-2019)

(and what went wrong)

Tiny (2004-2016)

Get Ovation DTP v1.55

📺 The SIBA stories 📹

List all b.log entries

Return to the site index

Alphabetical:

Search Rick's b.log!

PS: Don't try to be clever.
It's a simple substring match.

Last read at 13:38 on 2024/11/21.

[ b.log2 development version log ]

This web page is licenced for your personal, private, non-commercial use only. No automated processing by advertising systems is permitted.
RIPA notice: No consent is given for interception of page transmission.

Have you noticed the watermarks on pictures?

Read the explanation.

Next entry - 2020/06/03
Return to top of page

Retrieved from https://heyrick.eu/blog/index.php?diary=20200601 on 21st November 2024