I got a piece of C-code from Marti
Hearst from Berkeley. Here is a snippet I cut from the description. I'm really
curious what it will do with the VMS. But I won't have time to play around with
it for some weeks. Anyone of you who wants to try?
----
tile \- split a text file up into small
related pieces called tiles
tile [-oiv] [-b bound] [-n numiter] [-k kval] [-w wnum] [-p nopara] file ... The tile command is used to partition a document into a set of related pieces called tiles. The main purpose of this is to allow a finer granularity of indexing. Software that usually indexes documents or pages by words can now use this software to break the document up into smaller elements of related information (tiles) and index into that. The algorithm that determines the tiles in a document is essentially statistical in nature and is described in a paper by Marti Hearst entitled "Multi-Paragraph Segmentation of Expository Discourse", available from the U.C Berkeley Computer Science Division. A copy of that paper has been included with the source for the tile program. |