VMs: RE: provable?

I ran a simulation. I took the Wycliffe bible (Middle English) and wrote a program to hop around randomly picking up sections of text 3-5 characters long, spaces and letters included, and build a new text out of the bits that was about 1 megabyte in size. Then I counted the words and unique words in the text. The result was that there were 78075 total words, of which 92% occurred only once.

For anyone with a morbid interest, the text looks like this:

bleir tf swen tos cheeesenace ois poueod ofand sarijnneignacpreysperkiee bitmaedenprylijfgracune e Lorblamethersisdyngn frrofeeesse h
e an g anlle ctheereuedoone hinathentist hi faquihannoonme obilalm strothier de howas anith sem he h a reekyn gis bym eie tretenthe
t-fond e dymaiestoohimre od tisnoten rawecume Man thl oueoosngeew seit owisdoe thi deees th ibate wi herme of YdofIsraheelJerusaciou
it thoou therei e treeem d dat taltharaelsyneso me manynuurtoldh tireun eiththe dak agat tohaltheehee GoandbarJorird el e Tis nd m
youngerto fye ttake hees oot I chips ot bi whie onigyn an yng e to meinondururnes st pelemym tweissisofutyes beesis doe le Amf hee
d ofsonrinceingekitteueneir fWheryMoisestabeofn myidscoopt is thn fogided wisit heianfruyl plief hehefito tis dnd tdrynert he whayen
en ppe shindmai s notte f thoumakd th ge deur brotpriuncof yohewibiroodee andoldetho ds knofanoyte ingissseino thJerritm ie ter ofe d
f it e anden sleche menuermar when waof toulsistleftes ofete oThperslacoband hlid wthoumeny cosAaroto toolo schd ofe schgis in psilue
ng om k schthere is tha to dindoone en nom ishe on pleGod see lech l muis imy mlockd Sammaadhesthheueofwelideadrisoundeete aaxf hnus s
hondredden onbuttheis iid thn with ancomaechschcubitparie trunde bf thesayu scpeisid thistiohymsidlayne thGodmentiuerreeheLorthe e

As an additional note, the Wycliffe bible (complete with spelling irregularities and all) has a total of 22066 distinct words, of which 10050 occur more than once.

I would wager that the ratio of unique words to total vocabulary is higher for inflected, affixal and agglutinative languages, and lower for uninflected periphrastic languages. So...English, Chinese and Indonesian would be low, whereas Manchu, Arabic, Sumerian, Cree, and Latin would be high.

-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On Behalf Of Marke Fincher
Sent: Wednesday, September 29, 2004 7:49 AM
To: vms-list@xxxxxxxxxxx
Subject: VMs: provable?

Consider two documents. They are both created by piecing together 'chunks'

selected from a small underlying text. In one case it is an English document

and the chunks are English words taken from an English dictionary. In the other

case the chunks are selected randomly with a window moving over some

master page. In the first case the choice of selection is dictated by an intended

meaning, and in the second it is random. But how do you tell the difference?

You can't prove the English one wasn't created with the window technique; you

just have to allow different window sizes and accept an improbable (but possible)

set of choices for window positions.

You similarly cannot prove the random one has no meaning....

P.S. The bug I found was introduced for the last experiment only. Apologies again.

P.P.S I think it is vital to bear in mind at all times that over 6000 of the 8700 VMs

words occur only once in the whole manuscript.