[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMs: RE: provable?

Title: RE: whopping great chains (=63!!!)
I ran a simulation.  I took the Wycliffe bible (Middle English) and wrote a program to hop around randomly picking up sections of text 3-5 characters long, spaces and letters included, and build a new text out of the bits that was about 1 megabyte in size.  Then I counted the words and unique words in the text.  The result was that there were 78075 total words, of which 92% occurred only once.
For anyone with a morbid interest, the text looks like this:
bleir  tf swen tos cheeesenace ois poueod ofand sarijnneignacpreysperkiee bitmaedenprylijfgracune e Lorblamethersisdyngn frrofeeesse h
e  an g  anlle ctheereuedoone hinathentist hi faquihannoonme obilalm strothier de howas anith sem he h a  reekyn gis bym eie tretenthe
t-fond e dymaiestoohimre od tisnoten rawecume  Man thl oueoosngeew seit owisdoe thi deees th ibate wi herme of YdofIsraheelJerusaciou
it thoou therei  e  treeem d dat taltharaelsyneso me manynuurtoldh tireun eiththe dak  agat tohaltheehee  GoandbarJorird el e  Tis nd m
youngerto fye ttake hees oot I chips  ot bi whie onigyn  an yng e to meinondururnes st pelemym tweissisofutyes beesis doe le Amf hee
d ofsonrinceingekitteueneir  fWheryMoisestabeofn myidscoopt  is thn  fogided wisit heianfruyl plief hehefito tis dnd tdrynert he whayen
en ppe shindmai s notte f thoumakd  th ge deur brotpriuncof yohewibiroodee andoldetho ds knofanoyte ingissseino thJerritm ie ter ofe d
f it e anden sleche menuermar when waof toulsistleftes ofete oThperslacoband hlid wthoumeny cosAaroto toolo schd ofe schgis in psilue
ng om k schthere is tha to dindoone en nom ishe on pleGod see lech l muis imy mlockd Sammaadhesthheueofwelideadrisoundeete  aaxf hnus s
hondredden onbuttheis iid thn  with  ancomaechschcubitparie trunde  bf thesayu scpeisid thistiohymsidlayne thGodmentiuerreeheLorthe e
As an additional note, the Wycliffe bible (complete with spelling irregularities and all) has a total of 22066 distinct words, of which 10050 occur more than once.
I would wager that the ratio of unique words to total vocabulary is higher for inflected, affixal and agglutinative languages, and lower for uninflected periphrastic languages.  So...English, Chinese and Indonesian would be low, whereas Manchu, Arabic, Sumerian, Cree, and Latin would be high.
Brian Tawney
-----Original Message-----
From: owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On Behalf Of Marke Fincher
Sent: Wednesday, September 29, 2004 7:49 AM
To: vms-list@xxxxxxxxxxx
Subject: VMs: provable?

Consider two documents.   They are both created by piecing together 'chunks'
selected from a small underlying text.    In one case it is an English document
and the chunks are English words taken from an English dictionary.  In the other
case the chunks are selected randomly with a window moving over some 
master page.   In the first case the choice of selection is dictated by an intended
meaning, and in the second it is random.   But how do you tell the difference?
You can't prove the English one wasn't created with the window technique; you
just have to allow different window sizes and accept an improbable (but possible)
set of choices for window positions.
You similarly cannot prove the random one has no meaning....
P.S.   The bug I found was introduced for the last experiment only.  Apologies again.
P.P.S   I think it is vital to bear in mind at all times that over 6000 of the 8700 VMs
words occur only once in the whole manuscript.