[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
VMs: RE: provable?
Title: RE: whopping great chains (=63!!!)
I ran
a simulation. I took the Wycliffe bible (Middle English) and wrote a
program to hop around randomly picking up sections of text 3-5 characters long,
spaces and letters included, and build a new text out of the bits that was about
1 megabyte in size. Then I counted the words and unique words in the
text. The result was that there were 78075 total
words, of which 92% occurred only once.
For
anyone with a morbid interest, the text looks like this:
bleir tf swen tos cheeesenace ois poueod ofand
sarijnneignacpreysperkiee bitmaedenprylijfgracune e
Lorblamethersisdyngn frrofeeesse h
e an g anlle
ctheereuedoone hinathentist hi faquihannoonme obilalm strothier de howas anith
sem he h a reekyn gis bym eie tretenthe
t-fond e dymaiestoohimre
od tisnoten rawecume Man thl oueoosngeew seit owisdoe thi deees th ibate
wi herme of YdofIsraheelJerusaciou
it thoou therei e treeem d dat
taltharaelsyneso me manynuurtoldh tireun eiththe dak agat
tohaltheehee GoandbarJorird el e Tis nd m
youngerto fye
ttake hees oot I chips ot bi whie onigyn an yng e to meinondururnes
st pelemym tweissisofutyes beesis doe le Amf hee
d
ofsonrinceingekitteueneir fWheryMoisestabeofn myidscoopt is
thn fogided wisit heianfruyl plief hehefito tis dnd tdrynert he
whayen
en ppe shindmai s notte f thoumakd th ge deur brotpriuncof
yohewibiroodee andoldetho ds knofanoyte ingissseino thJerritm ie ter ofe d
f
it e anden sleche menuermar when waof toulsistleftes ofete oThperslacoband hlid
wthoumeny cosAaroto toolo schd ofe schgis in psilue
ng om k schthere is tha
to dindoone en nom ishe on pleGod see lech l muis imy mlockd
Sammaadhesthheueofwelideadrisoundeete aaxf hnus s
hondredden onbuttheis
iid thn with ancomaechschcubitparie trunde bf thesayu scpeisid
thistiohymsidlayne thGodmentiuerreeheLorthe e
As an
additional note, the Wycliffe bible (complete with spelling irregularities and
all) has a total of 22066 distinct words, of which 10050 occur more than
once.
I
would wager that the ratio of unique words to total vocabulary is higher for
inflected, affixal and agglutinative languages, and lower for uninflected
periphrastic languages. So...English, Chinese and Indonesian would be low,
whereas Manchu, Arabic, Sumerian, Cree, and Latin would be
high.
Brian
Tawney
-----Original Message-----
From:
owner-vms-list@xxxxxxxxxxx [mailto:owner-vms-list@xxxxxxxxxxx]On Behalf Of
Marke Fincher
Sent: Wednesday, September 29, 2004 7:49
AM
To: vms-list@xxxxxxxxxxx
Subject: VMs:
provable?
Consider two documents. They are both created by
piecing together 'chunks'
selected from a small underlying text. In
one case it is an English document
and
the chunks are English words taken from an English
dictionary. In the other
case
the chunks are selected randomly with a window moving
over some
master page. In the first case the choice of selection
is dictated by an intended
meaning, and in the second it is random. But how do you
tell the difference?
You
can't prove the English one wasn't created with the window technique; you
just
have to allow different window sizes and accept an improbable (but
possible)
set
of choices for window positions.
You
similarly cannot prove the random one has no meaning....
Marke
P.S. The bug I found was introduced for
the last experiment only. Apologies
again.
P.P.S I think it is vital to bear in mind
at all times that over 6000 of the 8700 VMs
words occur only once in the whole manuscript.