[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)

To: vms-list@xxxxxxxxxxx
Subject: Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)
From: Jacques Guy <jguy@xxxxxxxxxxxxxxxx>
Date: Sat, 29 Jan 2005 07:27:24 +1000
Reply-to: vms-list@xxxxxxxxxxx
Sender: owner-vms-list@xxxxxxxxxxx

28/01/2005 12:38:22 AM, Dennis <tsalagi@xxxxxxxx> wrote:

>Would you define "overfitting the 
>data" more fully, for a half-baked (unseasoned) 
>statistician?  :-)

I won't define it. I'll give an example.

But first, decide how you want to make a fortune:
the stock exchange, the horses, the roulette wheel?

I'll choose for you... the horses

Now (I am taking this out of a rating system in an
old, old book).

Use past result for, say, 1000 races. For each race
now

1. Look at the finish position of the horses at
   their last start.

2. Bet on the one with the lowest finish position.

3. Look at the results of the race, see if the horse 
   won, and how much.

Hmmm... you get far more winners than if you had 
just picked your horses at random, but you are still
losing heaps of money.

So you scratch your head and come up with a brilliant
idea: take the horses' finish at their last start AND
add to that the starting price. Example Running Snail
finished 3rd, at 4/1. Score: 3+4 = 7. Your pick for
each race is the horse with the lowest figure. (There is
logic in this way of rating horses. Think).

You'll probably get almost as many winners, but you will
be losing slightly less money.

But you are still losing money. So you put on your
thinking cap again... Weight! A high weight means
that the handicapper thinks the horse is better than
those with lower weight. Eureka! High weight is good.
So you add the last finish to the starting price, 
and you SUBTRACT that from the weight, e.g.
finish 3rd at 4/1 carrying 61kg: 61 - (3+4) = 54.
And you bet on the horse with the highest figure.

You go again through your data of 1000 races,
apply your formula, and... 

If it gets you a worse return, you go back to
finish + price and you put on your thinking cap 
again. If it gets you a better return (but you
still lose money), well, yes, you have to put
on your thinking cap again, to find some other
measure to _refine_ your formula.

Eventually I guarantee you that you will find a
formula that gets you a profit. It will look like
this:  subtract finish and price from weight carried
at the four last starts, giving you four figures;
apply these multipliers to those figures: 
4, 3, 2 , 1  for races run on Saturdays at Canterbury
on a fast track, but 6, 4, 3, 2 on a heavy track, 
5, 3, 3, 2 for races run on Wednesdays at Rosehill unless 
the race is a maiden in which case use 5, 2, 2, 1, 
except when... and so on and so on.

In fact, if you pile up enough rules, I am sure that
you will end up with the perfect system, giving you
the winner at every race. It will perfectly "predict"
the winners in your sample of 1000 races. 

Armed with your formula (perfect or just profitable), 
you now bet real money on the next 100 races. You will 
probably lose heavily (if you bet on the next 1000 races
you will certainly lose heavily), and much more heavily
if you used a "perfect" formula. Why? Your
formula is good only for the 1000 races from which
it has been extracted. 

But you do not despair, and you say: "I'll just have 
to refine my formula in the light of those new 100 
(or 1000) new races." And you will still lose because
you are only overfitting your data further!

>> Voilà! Overfitting the data. But not good enough yet, by far. 
>> For how many words which DO NOT occur will those five wheels 
>> reconstruct? 

>	Is that, then, the test for whether the data are overfitted?

No. It is a measure of how many tons of garbage your wheel
has to produce for each grain of truth.

A system of wheels (or Cardan grilles) designed to sift out 
this garbage might be a case overfitting the data, depending
on its complexity.

So we have to consider on one hand how good is the "fit"
of the text produced by wheels or grills (and I have given
a measure, based on chi-squared), and on the other hand
we must take into account the complexity of the wheels-or-grilles
system (which I do not know how to measure). If the wheels
are too complex they are overfitting the data.

Actually, I think that it would be easy to prove that...
hmmm... this gives me an idea. I'll have to think about it.

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list

Follow-Ups:
- Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)
  - From: Nick Pelling
- Re: VMs: Overfitting the Data
  - From: Koontz John E

Prev by Date: Re: VMs: RE: Another method different from Cardano Grilles
Next by Date: Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)
Previous by thread: More Piraha (Re: VMs: Welsh/Cornish)
Next by thread: Re: VMs: Overfitting the Data (WAS: Another method different from Cardano Grilles)
Index(es):
- Date
- Thread