[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: VMs: Overfitting the Data
Most of the time I don't think it is necessary to use sophisticated
mathematics to gauge overfitting. Given a generative system, just
examine what percentage of the words it generates are found in the
VMs. Some of the hypothetical wheel systems and their functional
equivalents will unavoidably generate 60,000 or more words!
Of course, if the VMs was generated by a process that involved a
random element, and even if that EXACT process was repeated, you
would not get exactly the same set of words again. But given the
frequency distribution of VMs words that we see (which is far from
flat) you would expect the hard-core of frequent words to reappear
in any subsequent rerun.
...and similarly for any modern proposed generating system to be
considered successful, it should generate nearly all of the
frequent VMs words, _but in similar proportions_
At the moment wheels just don't do it for me, unless they are
highly controlled via some other system from where the frequency,
word order and phrase patterns originate.
Marke
P.S. A related thought:
The much debunked and forgotten superblock experiments were able to
generate 50-70% of the real VMs vocab, but within an overall
generated vocab of only 16,000.
But, for those who weren't there, it was also possible to create
an English superblock of about 8000 bytes which could recreate
a comparable proportion of the KJV bible!
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list