Quantcast

Problems with Statistics package

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Problems with Statistics package

Hannes
Hi everyone,

I am using the octave statistics package right now, but have some  
unexpected (possibly bugged) results.

I have attached a log that shows the problem. On a perfect grouping,  
anova produces p-val of 1 instead of 0. The f-statistic being  
displayed as
-94353808874586048.0000
leads me to the conclusion that some internal number limits where  
exceeded, which is strange, because the example is very simple, the  
involved numbers very low etc.

Could someone help me with that?

Thank you very much!

Best,
Hannes

PS: System is:

octave:83> version
ans = 3.2.4
octave:84> uname
ans =
{
   sysname = Linux
   nodename = research-macpro01
   release = 3.2.0-24-generic
   version = #38-Ubuntu SMP Tue May 1 16:18:50 UTC 2012
   machine = x86_64
}

_______________________________________________
Help-octave mailing list
[hidden email]
https://mailman.cae.wisc.edu/listinfo/help-octave

octave_anova.log (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with Statistics package

marco atzeri-2
On 5/30/2012 7:56 AM, Hannes wrote:

> Hi everyone,
>
> I am using the octave statistics package right now, but have some
> unexpected (possibly bugged) results.
>
> I have attached a log that shows the problem. On a perfect grouping,
> anova produces p-val of 1 instead of 0. The f-statistic being displayed as
> -94353808874586048.0000
> leads me to the conclusion that some internal number limits where
> exceeded, which is strange, because the example is very simple, the
> involved numbers very low etc.
>
> Could someone help me with that?
>
> Thank you very much!
>
> Best,
> Hannes
>
> PS: System is:
>
> octave:83> version
> ans = 3.2.4
> octave:84> uname
> ans =
> {
> sysname = Linux
> nodename = research-macpro01
> release = 3.2.0-24-generic
> version = #38-Ubuntu SMP Tue May 1 16:18:50 UTC 2012
> machine = x86_64
> }
>
>

There is an overflow:

octave:4> anova(samples,grps)
warning: division by zero

One-way ANOVA Table:

Source of Variation   Sum of Squares    df  Empirical Var
*********************************************************
Between Groups              944.4969     1       944.4969
Within Groups                 0.0000   159         0.0000
---------------------------------------------------------
Total                       944.4969   160

Test Statistic f                 Inf
p-value                       0.0000

ans = 0




octave:5> version
ans = 3.6.1
octave:6> uname
ans =

   scalar structure containing the fields:

     sysname = CYGWIN_NT-6.1-WOW64
     nodename = MARCOATZERI
     release = 1.7.16s(0.261/5/3)
     version = 20120525 17:01:08
     machine = i686


with statistics-1.1.0


Regards
Marco
_______________________________________________
Help-octave mailing list
[hidden email]
https://mailman.cae.wisc.edu/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with Statistics package

Hannes

>
> There is an overflow:
>
> octave:4> anova(samples,grps)
> warning: division by zero

Note that I do not get a division by 0 warning on this example. But I  
do get quite a few division by zero warnings on other examples (always  
when the sample values are all identical). Because some of the results  
seemed right, I just disabled the warnings, but I don't feel I can  
trust the results at all right now.



> One-way ANOVA Table:
>
> Source of Variation   Sum of Squares    df  Empirical Var
> *********************************************************
> Between Groups              944.4969     1       944.4969
> Within Groups                 0.0000   159         0.0000
> ---------------------------------------------------------
> Total                       944.4969   160
>
> Test Statistic f                 Inf
> p-value                       0.0000
>
> ans = 0

At least you get the desired result (pval==0), while I get the opposite...

I have looked into the source code, which states the following:

   total_mean = mean (y(:));
   SSB = sum (group_count .* (group_mean - total_mean) .^ 2);
   SST = sumsq (reshape (y, n, 1) - total_mean);
   SSW = SST - SSB;
   df_b = k - 1;
   df_w = n - k;
   v_b = SSB / df_b;
   v_w = SSW / df_w;
   f = v_b / v_w;
   pval = 1 - fcdf (f, df_b, df_w);

Now the place where the problematic division takes place is
   f = v_b / v_w
which is variance_between by variance_within. This is the right thing  
to do according to Wikipedia, but of course variance_within can quite  
possibly be 0.
Now from my statistical intution the following results are expected:
v_b != 0 && v_w == 0 -> pval = 1
v_b == 0 && v_w == 0 -> pval = 0

But I am no expert on this... also I don't know what to do if v_w is  
really close to 0 but not 0.

Does anyone know more?


> octave:5> version
> ans = 3.6.1

I updated to 3.6.1 as well, no change.

Best,
Hannes

_______________________________________________
Help-octave mailing list
[hidden email]
https://mailman.cae.wisc.edu/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Problems with Statistics package

Hannes
Ok, I think I got it pinned downed and solved. I have attached a patch.

Heres the explenation:

>   total_mean = mean (y(:));
>   SSB = sum (group_count .* (group_mean - total_mean) .^ 2);

is always positive [sum of squares between]

>   SST = sumsq (reshape (y, n, 1) - total_mean);

is always positive and >= SSB [sum of squares total]

>   SSW = SST - SSB;

=> should always be >=0
but it is an ill conditioned operation if SST==SSB resulting in  
possibly negative error. Note that if the error is positive, it  
doesn't matter because it will produce f == Inf and p = 0 as expected.

>   df_b = k - 1;
>   df_w = n - k;
>   v_b = SSB / df_b;
>   v_w = SSW / df_w;

this can become negative, if the SSW is negative or 0 if SSW is zero  
or close to zero.


>   f = v_b / v_w;

this results in divide by zero if v_w is zero and results in *wrong  
result without warning* if v_w is negative.


My patch changes this whole block to:

   SSB = sum (group_count .* (group_mean - total_mean) .^ 2);
   SST = sumsq (reshape (y, n, 1) - total_mean);
   SSW = SST - SSB;
   if (SSW < 0) # machine error if SSB == SSW
     SSW = 0;
   endif
   df_b = k - 1;
   df_w = n - k;
   v_b = SSB / df_b;
   v_w = SSW / df_w;
   if (v_w != 0)
     f = v_b / v_w;
     pval = 1 - fcdf (f, df_b, df_w);
   elseif (v_b == 0)
     f = NaN;
     pval = 1;

0 / 0 is  NaN and the hypothesis mus be rejected

   else
     f = Inf;
     pval = 0;

x / 0 is Inf and hypothesis must be accepted (because there is no  
variance inside the sample but there is outside

   endif

HTH,
Hannes
_______________________________________________
Help-octave mailing list
[hidden email]
https://mailman.cae.wisc.edu/listinfo/help-octave

patch_anova.m (769 bytes) Download Attachment
Loading...