Delimiter-free languages – bus pirate and txtzyme

A few weeks ago I was experimenting with the clever little "bus pirate" board and its odd control language. A little later I was reading an article about some of the fun things Ward Cunningham has been getting up to with electronics, and I came across "txtzyme."

Thinking further on the similarities between the two systems led to the idea of delimiter-free languages, and idea which I think bears further thought.

What do I mean by "delimiter-free"? Both the commands processed by the bus pirate and the instructions which form txtzyme share a particular characteristic. Unlike mainstream programming languages (and even most textual data formats) they don't have any separators. No spaces, commas, semicolons, or anything else to "delimit" one word or value from another. Even languages with hardly any syntax usually have at least one delimiter. FORTH uses spaces, LISP uses spaces and brackets, CSV uses commas, plenty of languages use line breaks or semicolons, and so on.

The bus pirate and txtzyme languages require none of this. It is perfectly valid to just jam everything together; indeed this is often advantageous where speed is important, as no time is wasted parsing such unused characters.

Of course, not every language is suitable for this approach. Both the bus pirate and txtzyme languages consist mostly of single-character symbols. What interests me, though, is how far we might be able to take this delimiter-free approach into the territory of languages with multi-character symbols. Ideally I would like to see if I can achieve a completely "soft" language with no syntax of its own, yet which is capable of being programmed to emulate more traditional syntaxes when required.

At the moment the ELIUS parser does have delimiters - the parser recognizes the [, ], and (space) characters, and gives them special treatment. [ and ] are used to "quote" text, and spaces are used to separate words. Simplistically, characters are accumulated into a buffer until the next delimiter is encountered, then the contents of the buffer are looked up in the dictionary. If the word is not found, it is an error.

A delimiter-free language might be able to be more clever than that. Imagine a dictionary containing two words abc and abe. Anything else, such as "a", "bcdab", or "abd" is an error. There are no valid words which start with "b", so "b..." must always be an error, however many extra characters we add. Unfortunately, the parser does not know whether the word is valid until it has encountered a delimiter, so it has to go though adding all those wasted characters.

My thoughts at the moment lean toward representing valid words as a tree of transitions. For example, in our very simple two word system, the transitions are:

  • -> a
  • a -> b
  • ab -> c
  • ab -> e

If we are at the start of a word and receive anything other than 'a', it is an error. Once we have an a, anything other than a 'b' is an error. Once we have 'ab', only 'c' or 'e' are valid. The 'ab' -> 'c' and 'ab' -> 'e' transitions are somehow special, though. abc and abe represent complete valid words, whereas a and ab do not.

This is all well and good at tracking whether an sequence of input characters is valid, but it's not much use otherwise. We need a way to associate words with actions. Luckily the dictionary mechanism in ELIUS already does this. The trick in this case is that the executable code is not associated with a whole word, but with a particular node in the parse tree. This in turn becomes how we tell whether a sequence of characters represents a valid word - if the destination node in the tree has some action associated with it.

Parsing input becomes:

  1. reset the input state to "empty"
  2. read a character.
  3. if the character has a valid transition from the current state, advance the state and go back to step 2
  4. if the character does not have a valid transition, but the current state has an associated action, do it and go back to step 1
  5. if the character does not have a valid transition, and the current state does not has an associated action, error and go back to step 1

It's a bit wordy, and it hides a few important details (such as how to represent and traverse the tree), but this does seem a very powerful approach.

3 Comments

  1. Have to confess as a non-Forth programmer I didn’t follow all of the latter part of your blog-post, but the bit about “tree of transitions” kinda looks like the command tab-completions you get in the Bash shell ;-)

    I guess I’d describe BusPirate and txtzyme more as macro languages or command langauges rather than programming languages. I also wrote my own simple single-character command-language (running over an emulated USB serial connection) to interactively read/write the inputs and outputs on my MinimusUSB mentioned here (maybe I’ll convert it to txtzyme instead one day…)

    For a truly delimiter-free programming language look at http://www.muppetlabs.com/~breadbox/bf/ or for a delimiter-only language look at http://compsoc.dur.ac.uk/whitespace/ (written by one of my Uni coursemates).

    • I’d seen bf and whitespace before, but they feel more like abstract experiments rather than useful languages. Similarly I guess that all machine code and byte code languages have this delimiter-free property, but they are not really human-readable or directly composable into a more friendly (a.k.a domain-specific) language. Definitely need to muse on this a bit more.

  2. Frank,

    I also found out about Txtzyme – almost exactly at the same time as you. I blogged about it here

    http://sustburbia.blogspot.co.uk/2013/05/txtzyme-minimal-interpreter-and.html

    Then I went on to extend the “language” – so that you could write Forth-like colon definitions – I renamed it SIMPL – and I use it to exercise new hardware designs, ona variety of microcontrollers. It’s about as baremetal as you can get but useful for bringing up a new microcontroller and checking that all the basics are working.

    I’m now trying to extend it further as a rudimentary “quick & dirty” test language to run on FPGA soft cores.

    regards

    Ken

Leave a Reply

Your email address will not be published. Required fields are marked *