Data representation – squashing Morse code

I’ll say at the start that there’s no real reason to do this. The Raspberry Pi has far more memory than any little bare-metal program is ever likely to use. However, this just struck me as a bit of fun, and arguably instructive in thinking about data representations.

Last Time I wrote about Charles Moore, I linked to his colorforth.com web site. While poking around that site I found several things of interest, some of which I plan to blog about here. For now, though, what struck me was the way he chose to store word names compressed in single 32-bit values:

A word starts with 4 bits that indicate its color and function – text, number, etc. Then 28 bits of left-justified, Shannon-coded characters, averaging 5.2 bits each.

This seems a bit extreme for me, but it is very clever, and it led me to think about the wasteful way I represent the dots and dashes in my Morse code program.. In the original version, each dot or dash was represented by a whole byte of information, when in theory only a single bit is needed. Say a “dot” were represented by a 0, and dash by a 1, we could easily fit the whole symbol into a single byte.

For example, the letter ‘a‘ would be just 01

But there’s a problem here. A byte is a fixed size container which must contain eight bits. And we are only using two of them. if we pad out the rest of the byte with zeroes we get 01000000 for our letter ‘a‘. This seems reasonable, until we try and do some other letters. The letter ‘r‘ is dot, dash, dot. Recognizably different from the dot, dash of our ‘a‘, of course, until we try and put it into a byte. We get a worryingly similar 01000000, and guess what, the letter ‘l‘ is dot, dash, dot, dot, and that becomes the same code too.

If any of the symbols were to be eight glyphs in length, we would be stuck here, but luckily, we have a get-out. The maximum symbol length in standard Morse code is 5 glyphs. That gives us three bits to play with, conveniently enough to represent numbers from 0..7, We can now add a length into the byte, say at the start. so our ‘a‘ becomes 2(010) + 01000, ‘r’ would be 3(011) + 01000, and ‘l‘ would be ‘4(100) + 01000. Excellent, we now have a way of packing a whole Morse code symbol into a single byte. There is still one slight problem, though. The original solution allowed sending a space character as just a pause, with no dots or dashes. This can’t be done by the new method, but we can at least special-case it. Let’s say that any symbol of 0 length is sent as a pause.

This all seems reasonable so far, but let’s take a look at how we might emit the symbol.

In pseudocode it might be something like:

read the length
for bit i in 1..length
  if bit (3 + i) is 1
    show a dash
  else
   show a dot

All that bit twiddling seems a bit tricky. To get a single bit a common way is to mask out the rest of the bits using logical &, then divide (or shift) the value down to the bottom bit so it counts as a 1 or a 0. I reckon there might be a way that’s a bit (ha) more elegant..

If we reverse the order of the glyph bits, so the first one to be emitted is in the “ones” bit, we will be able to show it straight away by just ignoring the rest of the byte. To see the next glyph all we need to do is shift the whole byte one bit to the “right” (divide by two).

This is, of course, a bit trickier to describe in the data, but once it is in place, the data is much smaller. A rough calculation shows it is 26 letters + 10 digits, plus space, each taking two bytes for a total of 74 bytes. The original data took at least 208 bytes, and probably more like 356 allowing for those string pointers.

#define GLYPHS(n) ((n & 0b111) << 5)
#define COUNT(n) ((n >> 5) & 0b111)

struct Morse { char letter; uint8_t symbol; } code[] = {
    { 'a', GLYPHS(2) + 0b10 },
    { 'b', GLYPHS(4) + 0b0001 },
    { 'c', GLYPHS(4) + 0b0101 },
    { 'd', GLYPHS(3) + 0b001 },
    { 'e', GLYPHS(1) + 0b0 },
    { 'f', GLYPHS(4) + 0b00100 },
    { 'g', GLYPHS(3) + 0b011 },
    { 'h', GLYPHS(4) + 0b0000 },
    { 'i', GLYPHS(2) + 0b00 },
    { 'j', GLYPHS(4) + 0b1110 },
    { 'k', GLYPHS(3) + 0b101 },
    { 'l', GLYPHS(4) + 0b0010 },
    { 'm', GLYPHS(2) + 0b11 },
    { 'n', GLYPHS(2) + 0b01 },
    { 'o', GLYPHS(3) + 0b111 },
    { 'p', GLYPHS(4) + 0b00110 },
    { 'q', GLYPHS(4) + 0b1011 },
    { 'r', GLYPHS(4) + 0b0010 },
    { 's', GLYPHS(3) + 0b000 },
    { 't', GLYPHS(1) + 0b1 },
    { 'u', GLYPHS(3) + 0b100 },
    { 'v', GLYPHS(4) + 0b1000 },
    { 'w', GLYPHS(3) + 0b110 },
    { 'x', GLYPHS(4) + 0b1001 },
    { 'y', GLYPHS(4) + 0b1101 },
    { 'z', GLYPHS(4) + 0b0011 },
    { '0', GLYPHS(5) + 0b11111 },
    { '1', GLYPHS(5) + 0b11110 },
    { '2', GLYPHS(5) + 0b11100 },
    { '3', GLYPHS(5) + 0b11000 },
    { '4', GLYPHS(5) + 0b10000 },
    { '5', GLYPHS(5) + 0b00000 },
    { '6', GLYPHS(5) + 0b00001 },
    { '7', GLYPHS(5) + 0b00011 },
    { '8', GLYPHS(5) + 0b00111 },
    { '9', GLYPHS(5) + 0b01111 },
    { ' ', GLYPHS(0) }
};
static int nchars = sizeof(code) / sizeof (struct Morse);

The code to emit the glyphs is a tiny bit more involved than the original version, but arguable no larger:

void morse_glyph(uint8_t symbol) {
  if (symbol) {
    switch_on(dash_pause);
  } else {
    switch_on(dot_pause);
  }
  switch_off(gap_pause);
}

void morse_symbol(uint8_t symbol) {
  int len = COUNT(symbol);
  if (0 == len) {
    raspi_timer_wait(word_pause);
  } else {
    while (len > 0) {
      morse_glyph(symbol & 0b1);
      symbol >>= 1;
      --len;
    }
  }
  raspi_timer_wait(letter_pause);
}

void morse_char(char c) {
  for (int i = 0; i < nchars; ++i) {
    if (c == code[i].letter) {
      morse_symbol(code[i].symbol);
      return;
    }
  }
}

The morse_glyph() function is now given a single bit rather than a character, and all it does is emit a dash for 1 or a dot for 0. The morse_symbol() function has the most changes. It now extracts the length, checks for zero (pause) and if not it loops round count times, sending the bottom bit and then shifting the next bit down. The code for morse_char() is still exactly the same, but the type of the symbol field, and thus the type of the value passed to morse_symbol() has changed "underneath it".

I hope that was a mildly fun diversion.

Leave a Reply

Your email address will not be published. Required fields are marked *