Text-generation

Algorithms, data, and code for generating text.

I'm doing nanogenmo in June this year, becauase I'm otherwise employed during November. But whether I'll actually accomplish something, or just explore the intellectual space remains to be seen. This page is a collection of thoughts, ranging from the trivial to the substantial. It's a work-in-progress, and will probably remain one until it is abandoned.

This file is where I keep track of what I've done, plan to do, or even vaguely wonder about doing. The project is hosted at https://github.com/hendrikboom3/Text-generation .

I've gathered links to resources I found on the various nanogenmo repositories and placed them in resources.md.

What's working now.

I've ignored the various grandiose plans I've been hatching, and just went ahead and wrote count.ml. Upi will need an ocaml implementation to run it. Ocaml is free softwre, and is available for Mac, Windows, and Linux. For gnu make, the command is

 make unittest

There will possibly be variations on this trivial story before the month is up.

I found it surprisingly difficult to get the details of English number syntax right.

2016 06 04: I'm writing another one: this time I have the counting go transfinite through some constructible ordinals. The first version of this is in ord.ml. Unfortunately, the code has become a mess. I have to rewrite.

I did that, by building a new data structure to express the ordinals. These can easily be constructed in uniform loops. I no longer need to mess with the control flow to avoid printing expressions involving zeros and ones, because I use a generic simplification routine to reduce things like omega squared times zero plus seven to just seven, making it look idiomatic. That part is now working, but it is not well-intergrated with the narrative. I will next have to provide hooks to get the appropriate story narrative in the proper places. The trick is to introduce the patterns in the numbers when they first become relevant, and the to say the equivalent of "more of same".

It now counts up to $omega^{\omega 2+3} 3+\omega^{3} 3+\omega^{3}$. The Unix wc utility tells me it has 69988 words, some of which are surely HTML tags.

I now have a new appreciation for the size of these constructive ordinals. And this isn't even close to the league of big constructive ordinals.

The math on the page overloads my browser. iceweasel/firefoxx gave up long ago. Midori gave out when the wordcount approached 50000. Chrome will still display the entire page, but it takes it ages to do the whole page, and mathjax produces an error message after displaying everything correctly. Under the circumstances, if the novel isn't perfect, too bad.

Perhaps I should try again with straight LATEX output. Who knows? It might be practical.

2016 06 19

Indeed, straight LATEX output works a lot better. Once pdflatex is through with it, which is fast, the browser is completely happy with the pdf file.

Discussion on various ways to approach the problem

Set my ordinals and counting aside. They were fun, they produced words, but they lead nowhere. Though I still would like to get the flavour text into the ordinal program sometime. But probably not for ths month.

what seems to be needed.

Setting up such transitions might be able to drive something resembing a plot. Also need a planning engine to get the characters' actions fo relate to their motivations.

This seems a lot to set up in what remains of a month. Unless tools for this are already available and easy to learn the use of.

Goal-direction.

Continue brainstorming.

Implementation

It looks as if I'm going to want an object system to implement most of this stuff. The OCaml one, with row polymorphism (which they say others call duck typing, but the case-hardened duck typers say isn't duck typing at all because it's staticallt checkeed) and subtyping, seems attractive. I don't know if I'll be using OCaml classes at all. It would be nice ifOCaml's pattern match with subtypes may be the mechanism I need for checking preconditions to rules, but I suspect it won't work. I guess I'll have to fine out whether partial object types can be used to discover type information at run time that isn't statically available at compile time. IF that doesn't work, I'll have to interpret the patterns in their rules with my own hand-coded type tests and data structures.

Activities

This is probably more work than fits in a month already:

Components

text generator

I'll want a text generator that somewhat respects the rules of English grammar.

database

I'll want something like a triple store, that stores facts about the world. It should, ideally be possible to have triples in the store be objects that can be mentioned in other triples. There needs to be a way to write a triple store, read it in, with type checking, modify it if things happen in the story, and write it out again.

A data base? Maybe. It will probably have to be a heterogenous one, and to make it fast, it will probably work entirely in RAM, reading and writing to some human readable form that might be editable by humans.

What could be in the database?

Maybe subject-verb-object triples. Triples need to have IDs themselves, so they can be referened. It's probably important to be able to record which characters know which facts. THen the deductino engine should be able to reason based on what a specific character knows. Maybe this is a feature for later; it may be too much for an initial system. But something like this is necessary to give proper scope to a character's motivations. There will have to be some way of identifying which facts everybody knows; otherwise the number of stored A-knows-B facts will explode.

I call them triples, they are perhaps more properly called predications. Or even just facts. The verb would be the predicate, and they could be n-ary. I could go all combinatory-logic on this, but let's not.

Some typical facts:

There will probably have to be individuals, such as "Bob", and generic entities, such as "a man".

I've got to work this our carefully. Maybe that book on taxonomy will help. There has to be very limited deduction, and the facts will have to be organised to make this fast. I'm not looking for a full first-order theorem prover, even.

Revison management

The revision management system will be git and github. That's the one used for nanogenmo. .

Deduction

Some kind of logic engine on the database, to draw obvious conclusions. This wil probably grow on demand, as I need it, as it becomes more convenient to use general tools on data stored in the database than to code special cases. Havinf code in the data base woud be nice, too, but that would only work if the system is written in something like a List, rather than something like OCaml.

Some kind of planning system.

This could be a variant on the deduction system.

Context

I have no idea whether this could possibly fit with my as yet nonexistent project to randomly generate cities. That requires a geometry database.

Ordinal program.

Next project, to count ordinals. Not the trouble with ordinals is that given any mechanism for gennerating them, there's another bigger one that hasn't been covered yet. Not to mention that to get to $$\omega$$ there's in infinite number of them to enumerate first, so it's necessary to skip a lot of them. I think I'm going to have to just start and see what happens.