A few weeks ago I was experimenting with the clever little "bus pirate" board and its odd control language. A little later I was reading an article about some of the fun things Ward Cunningham has been getting up to with electronics, and I came across "txtzyme".
Thinking further on the similarities between the two systems led to the idea of delimiter-free languages, and idea which I think bears further thought.
What do I mean by "delimiter-free"? Both the commands processed by the bus pirate and the instructions which form txtzyme share a particular characteristic. Unlike mainstream programming languages (and even most textual data formats) they don't have any separators. No spaces, commas, semicolons, or anything else to "delimit" one word or value from another. Even languages with hardly any syntax usually have at least one delimiter. FORTH uses spaces, LISP uses spaces and brackets, CSV uses commas, plenty of languages use line breaks or semicolons, and so on.
The bus pirate and txtzyme languages require none of this. It is perfectly valid to just jam everything together; indeed this is often advantageous where speed is important, as no time is wasted parsing such unused characters.
Of course, not every language is suitable for this approach. Both the bus pirate and txtzyme languages consist mostly of single-character symbols. What interests me, though, is how far we might be able to take this delimiter-free approach into the territory of languages with multi-character symbols. Ideally I would like to see if I can achieve a completely "soft" language with no syntax of its own, yet which is capable of being programmed to emulate more traditional syntaxes when required.
At the moment the ELIUS parser does have delimiters - the parser recognizes the [
, ]
, and space
characters, and gives them special treatment. [
and ]
are used to "quote" text, and spaces are used to separate words. Simplistically, characters are accumulated into a buffer until the next delimiter is encountered, then the contents of the buffer are looked up in the dictionary. If the word is not found, it is an error.
A delimiter-free language might be able to be more clever than that. Imagine a dictionary containing two words abc and abe. Anything else, such as a
, bcdab
, or abd
is an error. There are no valid words which start with b
, so b...
must always be an error, however many extra characters we add. Unfortunately, the parser does not know whether the word is valid until it has encountered a delimiter, so it has to go though adding all those wasted characters.
My thoughts at the moment lean toward representing valid words as a tree of transitions. For example, in our very simple two word system, the transitions are:
- -> a
- a -> b
- ab -> c
- ab -> e
If we are at the start of a word and receive anything other than a
, it is an error. Once we have an a
, anything other than a b
is an error. Once we have ab
, only c
or e
are valid. The ab
-> c
and ab
-> e
transitions are somehow special, though. abc
and abe
represent complete valid words, whereas a
and ab
do not.
This is all well and good at tracking whether an sequence of input characters is valid, but it's not much use otherwise. We need a way to associate words with actions. Luckily the dictionary mechanism in ELIUS already does this. The trick in this case is that the executable code is not associated with a whole word, but with a particular node in the parse tree. This in turn becomes how we tell whether a sequence of characters represents a valid word - if the destination node in the tree has some action associated with it.
Parsing input becomes:
- reset the input state to "empty"
- read a character.
- if the character has a valid transition from the current state, advance the state and go back to step 2
- if the character does not have a valid transition, but the current state has an associated action, do it and go back to step 1
- if the character does not have a valid transition, and the current state does not has an associated action, error and go back to step 1
It's a bit wordy, and it hides a few important details (such as how to represent and traverse the tree), but this does seem a very powerful approach.