Syntax is fairly clearly understood by now. Grammars, parsers, and that stuff. It helps a lot if the syntax isn't entangled with semantics or type structure (for example, in C these things are entangled. It's impossible to parse C code without knowing which names are names for types). Most languages place unnecessary restrictions on syntactic nesting, such as that statements con contain expressinos, but not vice-versa. Barring entanglement, a clean conceptual interface between syntax and the other planes is tree structures. Lisp gets this part right, though there's a lot of unhappiness with its interface to humab beings.
So syntax: postpone, and start with the simplest thing that could possibly workm which is f(a, b, c) of (f a b c) We'll grow some conventions with operator priorities and left/right/no/chain associativity ans use them for things lie ':'.
Operations and types are less well understood. Here arises the issue of language level. There are several important distinctions here. There's the storage management -- if and when there's garbage collected. And there;s the topos used for composition -- this is what Haskell afficionados might call the monad, but I suspect the concept is more general.
Then there's the list of low-level operations made available, and the list of primitive data structures. In one sense, every usable language has to concern itself with these, because, ultimately, it has to be implemented using the mechanisms of real machines. They usually provide characters, bytes, words, addresses, and arithmetic and other operations on them. But some languages provide few of these to the programmer, preferring instead to provide things like arbitrarily sized integers. Such things can of course be provided by the programmer instead of the low-level primitives are provided. The difference is one of convenience, reliability, and performance.
Storage managemend -- on a stack, on a heap.
Machine-oriented types -- the usual char, byte, int, float, pointer-to-X, and combinations of theese. Dependent types, assertions defined by implementation and by formal logic.
Modularity. Restricted access as a form of security. Reflection as a way to violate this -- or does this have to be considered a lower-level implementation detail.
This kind of setup looks like a tool for parsing, type-checking, or code-generation.
It does *not* look particularly great for transformation, of the traditional suite-of-rules kind, which reworks the same (modified) tree piece of tree over and over again. Though it could be used to express individual rewrite rules, even complicated recirsive ones.
then, of course, there's the question of executing the generated post/[refix code, deciding any cchoices (with costs) that haven't yet been decided.
Then, of course, there's the question of notation for denoting the matching operations, not one at a time, but in a suite, structures, And I suspect this notation might even be type-checkable.
Is thise kind of like the parsing and generating parts of rdp, generalized and reorganised? Eliminating the trafo component may ne a huge siplification? Trafo, in practice, was never about organising a suite of independent rules; always about recursive translation.
The rdp parser would certinly be expressed in this form. But some of its aspects would be good to have in backtrack-immune form (such as identification of words (Hey! That's preparse trafo?!)
So ... we have data structures traversal and generation (with generated conditionals and backkpatching -- that's backtracking?) with prrovision for modifying the input (details are, of course, data-structure-dependent) ans executing the (provisional) output,
Abstract notation should be data-strucure-independent, but there should, of course, be structure-dependent operations/notations.
The convenient way to process program text is as a tree structure in memory. But there are alternatives. The traditional ones are prefix and posttfix strings. They have advantages in their simplicity of access and memory management, provided processing is done in the poper order. Now some algorithms naturally process code in such orders, and if those algorithms are apprpopriate, that can be the best waay to process them.
But it would be nice to express what is to be done in a notation that is independent of the representation, so that the choice of representation can be done when we already know what is to be accomplished.
Input may be free, in which case it has to be handled with suspicion: all things checked, the reader checking that types are correct, and that the input is syntactically well-formed.
Or input may be typed, in which case we can use types known from context. The input does not need to be type-checked; its type-correctness is a precondition.
Or input may contain type marks. This is a special case of free; the type marks are just part of the grammar that we have to check.
This is probably relevant for prefix and tree input; for postfix it's not likely to be a lot of use.
Instead of using Gambit or Scheme or Racket, I wrote a sed script that did the conversion. It took a few hours, the script was debugged into existence. I conclude that sed, though effective in this instance, was the wrong tool for this kind of job.
First, sed is line-oriented; html isn't. But there was a fairly rigid use of newlines in the input, and this was just compatible enough with sed's line-at-a-time attitude that the situation wasn't completely hopeless. Sed did have a mechanism for requesting explicitly that a new line be imported into its marching processing buffer.
The logical level at which to function is after tokenisation, but before parsing, I want to recognise a number of idioms and replace then with proper HTML tags. And given that some of these idioms look similar to parts of others (and vice versa) some kind of prioritisation is needed. Not to mention some control structure to prevent some replacements from being made in the output of others.
Here''s what I seem to want.
Something that takes a stream of tokens, which may be as simple as characters, but may be complicated things. Some kind of pattern-matching/parsing of this stream, and from time to time recognising structures that are replaced in-place. Traditional parsing can be formulated like this by replacing specific strings of tokens, in the right context, by single parse trees, which also count as tokens (nonterminals). But there's no reason to get nonterminals. You can replace recognised strings of tokens by pretty arbitrary other strings of tokens. These may get reparsed, or not. And you want a mechanism for indicating you're done with everything you've seen up to a point, so it can be written out and stop cluttering up memory.
So operations of
Of course, token-recognisers have parameters and results, phrases can be embedded in subroutines, which have parameters and results. Parameters, in- and out- can be marks.
And it's also possible to type all this. Marks vs phrases. Phrases vs. different kinds of phrases. Or different kinds of marks. Etc.
First language to experiment with all this is probably Scheme.
But I've alreay done something like this in m3 -- was it called piggy? It didn't really have very dynamic types, though. There I had different representations for several different kinds of types (integer, string?, pointer, maybe some kind of structure?) on the stack (which was a list) and different typed instructions to manipulate them -- these instructions were also in a tree of lists. They were type-checked (no dependent types, though) when being built into the list tree. There was a lot of tedious code involving these operations -- mostly the same, but for different types. But not quite regular enough to be implemented cleanly by a template mechanism. It was meant to be a model for machine code, where different instructions woiuld have to be used.
There's also the stack machine written in C and assembler. C was an ugly language for this. What I'd probably need to do is bootstrap an implementation of this idea which is just enough to implement itself, including the garbage collector. Then play with it from there. By the way, the implementation I had did not bootstrap itself, but it was possible to garbage collect the garbage collector if it wasn't needed any more, perhaps by the run-time installation of another garbage collector. This was in fact tested by replacing the collectoor by another copy of itself.
We'd need a translator to some already implemented language to start with. Possibly some of this translator could be written in an interpreted subset of its language, so that less effort need be done for later bootstrapping. We might want an entermediate code.
For the M3 situation, we;d probably want to separate the abstract machine interface (just above the type-checking) from the actual input notation, which is ugly. Same for the C system but that's uglier.
There's also the JIT into machine code I wrote, which lacks any kind of structure that a garbage collector could hook into. That's another rthing to look at. Again, one challenge is to make a higher-level notation to use to generate the calls to the JIT engine.
And this does look something like my old rdp, but orthogonalised, and, therefore, completely different.
Actually, the conceptual structure of the C threaded code interpreter is probably right. It's a threaded code interpreter, first written in assembler, then a C driver was tacked on to help in debugging, then that driver got types to further help debugging, and a new input reader that reverses stuff. But C is an awkward language to do anything but the lowest-level stuff.
The Modula 3 one used a lot of linked lists for codem did types, and had a probably too complicated syntax on top to be really practical for the kind of metalibguistic stuff I have in mind.
Taks the Modula stuff and isolate the Polish processor that actually takes instructions, salts them away, and executes them. Two levels -- the yes man that just takes care of the mechanics, and the checker that does the types. These both take input via mrocedure/method calls. Maybe another checker thing that concatenates strings of ops given pre and post conditions. That's something I was realising to be useful even when I was writing the one I've got.
Then abote that, some kind of bootstrapping syntax for a sublanguage that parses strings ot caracters, tokens, things, including looking into the things.
So the real start is not the existing Modula 3 code, but to refactor the existing Modula 3 stuff along the lines of the C threaded code interpreter. Except the one thing we really don't need form C is storing everything in an array, That's lower-level than we need. We do need the conceptual layering.
I spent a day trying to refactor that Modula 3 code, and found it surprisingly resistant. The trouble is, there's application code mixed into the interpreter/typechecker/parser, and there shouldn't be. The i/p/c should provided the mechanisms by which the application can define lower-level primitives into the interpreterl it shouldn't contain them. But to achieve this, I seem to have to expose a lot more interpreter to the application than I really like.
But I got a small amount done in this direction. But it would re really nice to be able to test the code. Ten or fifteen years ago, I seem to have left the code in a form that compiled, but didn't run properly -- presumably the state it was in when I got interested in something else or were forced to divert my efforts toi so-called Real Life. And not having a viable revision manaemeng system back then, I didn't have a change log or a last working version. Or even a clearly identifiable working test case. I presumably didn't worry about this, becaue the code was only of interest to me, and no one was relying on me. I hope my practices on personel programming have improved since then.
It looks as if the application written in the scripting language was the main test rig, and that it was in the process of being changed when I stopped woring on the project. What a mess! The main lesson? Don't have done that. In fact, it even looks as if I was playing around with variatinos in the language the test cases were written in, apparently to decide on grammar changes. A legitimate thing to do, but not a good state to leave things in.