Language design tryout area

Programming Language has three planes:

Ideally I'd like these to be independent, so that experimentation becomes easy.

Syntax is fairly clearly understood by now. Grammars, parsers, and that stuff. It helps a lot if the syntax isn't entangled with semantics or type structure (for example, in C these things are entangled. It's impossible to parse C code without knowing which names are names for types). Most languages place unnecessary restrictions on syntactic nesting, such as that statements con contain expressinos, but not vice-versa. Barring entanglement, a clean conceptual interface between syntax and the other planes is tree structures. Lisp gets this part right, though there's a lot of unhappiness with its interface to humab beings.

So syntax: postpone, and start with the simplest thing that could possibly workm which is f(a, b, c) of (f a b c) We'll grow some conventions with operator priorities and left/right/no/chain associativity ans use them for things lie ':'.

Operations and types are less well understood. Here arises the issue of language level. There are several important distinctions here. There's the storage management -- if and when there's garbage collected. And there;s the topos used for composition -- this is what Haskell afficionados might call the monad, but I suspect the concept is more general.

Then there's the list of low-level operations made available, and the list of primitive data structures. In one sense, every usable language has to concern itself with these, because, ultimately, it has to be implemented using the mechanisms of real machines. They usually provide characters, bytes, words, addresses, and arithmetic and other operations on them. But some languages provide few of these to the programmer, preferring instead to provide things like arbitrarily sized integers. Such things can of course be provided by the programmer instead of the low-level primitives are provided. The difference is one of convenience, reliability, and performance.

Storage managemend -- on a stack, on a heap.

Machine-oriented types -- the usual char, byte, int, float, pointer-to-X, and combinations of theese. Dependent types, assertions defined by implementation and by formal logic.

Modularity. Restricted access as a form of security. Reflection as a way to violate this -- or does this have to be considered a lower-level implementation detail.

Tree Processing Tool

The idea is to make a tool that operates on trees by parsing, generating, and executing notations encoded as trees. The trip.pol branch is for doing this with efficient storage-use, with everything encoded as Polis strings, prefix or postfix.

Matching:

Need a set of matching operatinos. They check datastructure (could be and expression involving arith and deref; could be mvalues (such as in the lang-intel tool)) and could have side-effects (generating, say prefix or postfix or data structures, accumulating costa) and do backtracking (mark and reset,these may even be such as to retain a tree of alternatives). This stuff may be garbage-collected or not.

This kind of setup looks like a tool for parsing, type-checking, or code-generation.

It does *not* look particularly great for transformation, of the traditional suite-of-rules kind, which reworks the same (modified) tree piece of tree over and over again. Though it could be used to express individual rewrite rules, even complicated recirsive ones.

then, of course, there's the question of executing the generated post/[refix code, deciding any cchoices (with costs) that haven't yet been decided.

Then, of course, there's the question of notation for denoting the matching operations, not one at a time, but in a suite, structures, And I suspect this notation might even be type-checkable.

Is thise kind of like the parsing and generating parts of rdp, generalized and reorganised? Eliminating the trafo component may ne a huge siplification? Trafo, in practice, was never about organising a suite of independent rules; always about recursive translation.

The rdp parser would certinly be expressed in this form. But some of its aspects would be good to have in backtrack-immune form (such as identification of words (Hey! That's preparse trafo?!)

So ... we have data structures traversal and generation (with generated conditionals and backkpatching -- that's backtracking?) with prrovision for modifying the input (details are, of course, data-structure-dependent) ans executing the (provisional) output,

Abstract notation should be data-strucure-independent, but there should, of course, be structure-dependent operations/notations.

---

Development of a few formalisms

  1. 1. A seqience of symbols. No interpretation at all at this level.
  2. 2. Operators and operands. It's assumed the reader knows how many operands each operator has.
  3. 3. Operators and operands, explicit delimination of nesting. Like f(...) or (f....) conceptually.
  4. 4. Operators and operators as in 2, with aggregators taking single seqience operands, sequences as in 1.
4, looks nice.

Implementing textual management

The convenient way to process program text is as a tree structure in memory. But there are alternatives. The traditional ones are prefix and posttfix strings. They have advantages in their simplicity of access and memory management, provided processing is done in the poper order. Now some algorithms naturally process code in such orders, and if those algorithms are apprpopriate, that can be the best waay to process them.

But it would be nice to express what is to be done in a notation that is independent of the representation, so that the choice of representation can be done when we already know what is to be accomplished.

Types

Typechecking this stuff, based on grammars for the different kinds of data, will be a challenge, will be the research component here.

Input may be free, in which case it has to be handled with suspicion: all things checked, the reader checking that types are correct, and that the input is syntactically well-formed.

Or input may be typed, in which case we can use types known from context. The input does not need to be type-checked; its type-correctness is a precondition.

Or input may contain type marks. This is a special case of free; the type marks are just part of the grammar that we have to check.

This is probably relevant for prefix and tree input; for postfix it's not likely to be a lot of use.

2011 06 05

Just finished using sed to clean up some very ugly HTML -- the first chapter of one of the Shadow Chronicles. I couldn't just use an HTML parser and then make simple pattern-matching transformations. The trouble is that the input had the wrong structure. Titles and scene breaks were marked as paragraphs; paragraphs were marked as line breaks with a lot of non-breaking spaces to indent the first line. Except every now and then an actual HTML paragraph marker was used for a paragraph, just for elegant variation. Building this into a tree structure in a straightforward way (perhaps following HTML spec) would have given the wrong structure; whereas if all paragraphs were marked with a proper paragraph marker, the parser would have produced the right structure. And there were lots of other line breaks as the HTML explicitly forced layout and line width.

Instead of using Gambit or Scheme or Racket, I wrote a sed script that did the conversion. It took a few hours, the script was debugged into existence. I conclude that sed, though effective in this instance, was the wrong tool for this kind of job.

First, sed is line-oriented; html isn't. But there was a fairly rigid use of newlines in the input, and this was just compatible enough with sed's line-at-a-time attitude that the situation wasn't completely hopeless. Sed did have a mechanism for requesting explicitly that a new line be imported into its marching processing buffer.

The logical level at which to function is after tokenisation, but before parsing, I want to recognise a number of idioms and replace then with proper HTML tags. And given that some of these idioms look similar to parts of others (and vice versa) some kind of prioritisation is needed. Not to mention some control structure to prevent some replacements from being made in the output of others.

Here''s what I seem to want.

Something that takes a stream of tokens, which may be as simple as characters, but may be complicated things. Some kind of pattern-matching/parsing of this stream, and from time to time recognising structures that are replaced in-place. Traditional parsing can be formulated like this by replacing specific strings of tokens, in the right context, by single parse trees, which also count as tokens (nonterminals). But there's no reason to get nonterminals. You can replace recognised strings of tokens by pretty arbitrary other strings of tokens. These may get reparsed, or not. And you want a mechanism for indicating you're done with everything you've seen up to a point, so it can be written out and stop cluttering up memory.

So operations of

Of course, token-recognisers have parameters and results, phrases can be embedded in subroutines, which have parameters and results. Parameters, in- and out- can be marks.

And it's also possible to type all this. Marks vs phrases. Phrases vs. different kinds of phrases. Or different kinds of marks. Etc.

First language to experiment with all this is probably Scheme.

By date 2012 04 19

Modula 3 has the garbage collection, has the primitive datatypes I need, and also is type-safe by itself. But it won't be great on direct to machine code JIT compiling, because interfacing with the GC is going to be awkward. As far as I know, the M3 run-time system doesn't have any kind of distributed notion of type -- types appear to be centralized into one table. It may be awkward to add types at run time, since the panguage has no need for this..

But I've alreay done something like this in m3 -- was it called piggy? It didn't really have very dynamic types, though. There I had different representations for several different kinds of types (integer, string?, pointer, maybe some kind of structure?) on the stack (which was a list) and different typed instructions to manipulate them -- these instructions were also in a tree of lists. They were type-checked (no dependent types, though) when being built into the list tree. There was a lot of tedious code involving these operations -- mostly the same, but for different types. But not quite regular enough to be implemented cleanly by a template mechanism. It was meant to be a model for machine code, where different instructions woiuld have to be used.

There's also the stack machine written in C and assembler. C was an ugly language for this. What I'd probably need to do is bootstrap an implementation of this idea which is just enough to implement itself, including the garbage collector. Then play with it from there. By the way, the implementation I had did not bootstrap itself, but it was possible to garbage collect the garbage collector if it wasn't needed any more, perhaps by the run-time installation of another garbage collector. This was in fact tested by replacing the collectoor by another copy of itself.

We'd need a translator to some already implemented language to start with. Possibly some of this translator could be written in an interpreted subset of its language, so that less effort need be done for later bootstrapping. We might want an entermediate code.

For the M3 situation, we;d probably want to separate the abstract machine interface (just above the type-checking) from the actual input notation, which is ugly. Same for the C system but that's uglier.

There's also the JIT into machine code I wrote, which lacks any kind of structure that a garbage collector could hook into. That's another rthing to look at. Again, one challenge is to make a higher-level notation to use to generate the calls to the JIT engine.

And this does look something like my old rdp, but orthogonalised, and, therefore, completely different.

2012 05 28

It looks as if the next step in this project is to hack the stack machine written in Modula 3 so it contains text-processing primitives as described above. It already has a rudimentary type system. I'd want to add types that describe partially parsed text.

Actually, the conceptual structure of the C threaded code interpreter is probably right. It's a threaded code interpreter, first written in assembler, then a C driver was tacked on to help in debugging, then that driver got types to further help debugging, and a new input reader that reverses stuff. But C is an awkward language to do anything but the lowest-level stuff.

The Modula 3 one used a lot of linked lists for codem did types, and had a probably too complicated syntax on top to be really practical for the kind of metalibguistic stuff I have in mind.

Taks the Modula stuff and isolate the Polish processor that actually takes instructions, salts them away, and executes them. Two levels -- the yes man that just takes care of the mechanics, and the checker that does the types. These both take input via mrocedure/method calls. Maybe another checker thing that concatenates strings of ops given pre and post conditions. That's something I was realising to be useful even when I was writing the one I've got.

Then abote that, some kind of bootstrapping syntax for a sublanguage that parses strings ot caracters, tokens, things, including looking into the things.

So the real start is not the existing Modula 3 code, but to refactor the existing Modula 3 stuff along the lines of the C threaded code interpreter. Except the one thing we really don't need form C is storing everything in an array, That's lower-level than we need. We do need the conceptual layering.

2012 05 31

This seems to be becoming my tech blog -- at least for computer programming. Maybe just a tech diary. actually.

I spent a day trying to refactor that Modula 3 code, and found it surprisingly resistant. The trouble is, there's application code mixed into the interpreter/typechecker/parser, and there shouldn't be. The i/p/c should provided the mechanisms by which the application can define lower-level primitives into the interpreterl it shouldn't contain them. But to achieve this, I seem to have to expose a lot more interpreter to the application than I really like.

But I got a small amount done in this direction. But it would re really nice to be able to test the code. Ten or fifteen years ago, I seem to have left the code in a form that compiled, but didn't run properly -- presumably the state it was in when I got interested in something else or were forced to divert my efforts toi so-called Real Life. And not having a viable revision manaemeng system back then, I didn't have a change log or a last working version. Or even a clearly identifiable working test case. I presumably didn't worry about this, becaue the code was only of interest to me, and no one was relying on me. I hope my practices on personel programming have improved since then.

It looks as if the application written in the scripting language was the main test rig, and that it was in the process of being changed when I stopped woring on the project. What a mess! The main lesson? Don't have done that. In fact, it even looks as if I was playing around with variatinos in the language the test cases were written in, apparently to decide on grammar changes. A legitimate thing to do, but not a good state to leave things in.