Just enough language to define new words

This is a bit of a milestone for me. I have got to the point where I now have enough of the ELIUS language to enable it to begin to define itself. This may seem an odd thing to want to do, so I’ll try and explain in a bit more detail what I am aiming for, why I took made some of the choices I did, and how it currently works.

Why do it?

Writing a programming language in the way that I am doing it is arguably a more difficult route. Programming in an existing high-level language with plenty of library support is always going to be easier than “bootstrapping”, so why do it?

For me there are at least three reasons. The first reason is portability. Any part of the language which is written in another language requires that other language to build and/or run it. That means that the new language will only be usable in environments where the development language can be used. I’m hoping that once I get the basics in place I will be able to write and run programs on a wide variety of platforms, from big, fast PCs to small, cheap, or specialist microcontrollers.

The second reason is for development. Although I love my Raspberry Pi, I don’t always have it available to develop code on. For that I use whatever Linux, Windows or Mac machine I happen to have to hand. I want to be able to develop and test code written in my new language on all these machines.

And the third reason is, well, because the challenge is fun. I have created languages of various sorts built on top of existing high-level languages many times. This one I want to be as “close to the metal” as I can make it.

Design choices

For all the above reasons, it makes sense to keep the bit which absolutely has to be written in an existing language (the “germ”, if you will) as small as possible. This is the bit which will need to be “ported” to every different CPU architecture or platform. Particularly for target systems with relatively small amounts of memory and/or slow CPUs, I want to be able to re-code this part of the language in assembly-language, even though or the current development version it is written in C. The more of it there is, the harder and more error-prone this will be.

To keep the germ of the language small I have made the decision to code everything that I can sensibly implement in the new language, even if it is arguably easier in the development language. This forces me to be aware of exactly how much of the germ code is really needed and to be bold about trimming it wherever I can.

Another major design choice is to start with a single-process version of the system. Although I have many ideas about how I might implement multi-processing, I don’t want it to complicate things right now. There are plenty of possible applications which don’t need it, and I want to get to the point where I can make some of those as soon as possible.

Last in this section for today, I have decided to make the code as simple as I can, even when that means incorrect usage could cause problems. I don’t want the germ code to become riddled with checks for null pointers or illegal accesses. If that sort of thing is required by a particular application, it will need to be done in the lowest-levels of the application code rather than in the system.

How it currently works

When the system is running, data is stored in three main data structures: the data stack, the string pool, and the dictionary. There are also a few others (return stack, input buffer and scratchpad, for example) but they can be safely ignored for this introduction.

The busiest data structure is the data stack. This is used for transfer of information between bits of code. There are no functions or procedures with parameter lists – each bit of code gets what it needs from the stack, and it is the responsibility of the application (or of the user if this is an interactive session) to ensure that the right values have been placed on the stack in the right order before the word is called. The stack is allocated a block of memory, between system constants DS_START and DS_END. Each “slot” in the stack is the size of a “word” (which can hold an integer or a memory address). The fundamental operations are:

void push(word) adds a word to the “top” of the stack, and moves the system variable DS_TOP on on wword.
word pop() does the reverse – move DS_TOP back one word and return the word which was at the top.

The string pool, on the other hand, is mostly static once a program has been loaded. It lives in a chunk of memory between system constants POOL_START and POOL_END and consists of a linked chain of records. Each record has a length (indicating the number of characters or bytes in the entry), the address of the next entry, and the bytes representing the characters themselves. This means that the pool records are variable in length. For memory access reasons, the start of each pool entry is aligned on a word boundary, so the address of the next entry will not always be the same as the address after the end of the characters. This is why the record has both a length and a next pointer. The fundamental operations are:

address padd(address source, word length) adds a word to the next free entry in the pool. The supplied length is copied into the length field, the bytes are copied from the source address to the end of the string pool, and the next pointer is set with the rounded-up address of the next record.
address plup() looks for a string in the pool, returning the address of its pool record if found, or the special token 0xFFFFFFFF if not found.

There is also an extra operation, built from the two above. address pens(address source, word length) “ensures” that a string is present in the pool, returning an existing record address if the string is already present, or adding it with padd if it is not found.

Now on to the data structure I am most concerned with at the moment, the dictionary. The dictionary is used to look up and execute words, either “primitives” which are coded directly on the underlying machine, or “definitions”. Unlike the string pool which is searched from start to finish, the dictionary is searched backwards, allowing new words to override old ones if required. Each dictionary record consists of four fields: the name (the address of the string pool record representing the name of the word, the type (the address of some code to call when the word is executed), the parameter (data to pass to the type code so that the same code can be used for different purposes) and the address of the previous entry. The fundamental operations are:

address dlup(address symbol) given the address of a pool record try to find a word with that name and return the address of its dictionary record or 0xFFFFFFFF if not found.
void execute(typefn fn, word param) executes a type function, passing it the supplied parameter..
void evaluate(address p, word length) parse and execute a sequence of characters as follows:
1. any characters between [ and ] are gathered and passed to pens and the resulting address is pushed onto the stack
2. any non-white-space characters are gathered and passed to plup. If found, the type and parameter fields are passed to execute (which may result in pushing or popping values on the stack etc.). If the word is not found in the dictionary and it is a valid number, that number is pushed on the stack.
3. when the supplied character sequence is empty, evaluation ends,leaving what remains on the stack.

How words get into the dictionary is another question. For a definition, the process involves finding the first free slot in the dictionary. putting the address of the name in the name field, putting the address of the code to execute in the param field, putting the address of an evaluate function in the type field, and putting the address of the previous record in the prev field. To keep everything straight for the next word to be added or looked up, also move the DICT_HEAD and DICT_NEXT system variables to include the new definition.

So eventually, here are the current minimum set of words in ELIUS:

name	type	param	notes
@	primitive	prim_w_read	read the word at an address on the stack, and push the read value
!	primitive	prim_w_write	store a value on the stack to an address on the stack
W+	primitive	prim_w_plus	add the size of a (32-bit)word to an address on the stack
DICT_HEAD	literal	DICT_HEAD	push the address of DICT_HEAD on the stack
DICT_NEXT	literal	DICT_NEXT	push the address of DICT_NEXT on the stack
DEF_FN	literal	defined	push the address of the evaluation function on the stack
DENT_NAME	dict_offset	DENT_NAME	add the offset of the DENT_NAME field to the value of DICT_NEXT and push the result
DENT_TYPE	dict_offset	DENT_TYPE	add the offset of the DENT_TYPE field to the value of DICT_NEXT and push the result
DENT_PARAM	dict_offset	DENT_PARAM	add the offset of the DENT_PARAM field to the value of DICT_NEXT and push the result
DENT_PREV	dict_offset	DENT_PREV	add the offset of the DENT_PREV field to the value of DICT_NEXT and push the result
dent_set	defined	`DEF_FN DENT_TYPE ! DENT_NAME ! DENT_PARAM !`	set the values of the fields in the next dictionary record
dent+	defined	`W+ W+ W+ W+`	add the size of a dictionary record to an address
dent_next	defined	`DICT_NEXT @ DICT_HEAD ! DICT_NEXT @ dent+ DICT_NEXT !`	update the DICT_HEAD and DICT_NEXT addresses
dent_blank	defined	`DICT_HEAD @ DENT_PREV ! 0 DENT_NAME ! 0 DENT_TYPE ! 0 DENT_PARAM !`	clear out the next free dictionary record
def	defined	`dent_set dent_next dent_blank`	finally, define a new word based on a name and body on the stack

With all this, we can now do something like [123 W+] [ugh] def which uses the words above, defined in ELIUS, to manipulate the dictionary data and add a new word. This new word can now be called just like any of the built-in words, by entering its name ugh. In this case it will end up with 139 (i.e 123 + (4 * 4 bytes) on the stack.

I hope that all made some sort of sense. For me it is a big step forward. Now I need to use these tools to build a first “interesting” program, and along the way to see what other core words might need to be defined.

Raspberry Alpha Omega

Raspberry Pi from start to finish

Just enough language to define new words

Why do it?

Design choices

How it currently works