I got a piece of C-code from Marti Hearst from Berkeley. Here is a snippet I cut from the description. I'm really curious what it will do with the VMS. But I won't have time to play around with it for some weeks. Anyone of you who wants to try?
tile \- split a text file up into small related pieces called tiles
tile [-oiv] [-b bound] [-n numiter] [-k kval] [-w wnum] [-p nopara] file ...
The tile command is used to partition a document into a set of
related pieces called tiles. The main purpose of this is to allow
a finer granularity of indexing. Software that usually indexes documents
or pages by words can now use this software to break the document up into
smaller elements of related information (tiles) and index into that.
The algorithm that determines the tiles in a document is essentially
statistical in nature and is described in a paper by Marti Hearst
entitled "Multi-Paragraph Segmentation of Expository Discourse", available
from the U.C Berkeley Computer Science Division. A copy of that paper
has been included with the source for the tile program.