Re: splinefit test failures

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: splinefit test failures

Rik-4
On 08/01/2012 09:59 AM, [hidden email] wrote:

> Message: 7
> Date: Wed, 01 Aug 2012 11:59:02 -0500
> From: Daniel J Sebald <[hidden email]>
> To: "John W. Eaton" <[hidden email]>
> Cc: octave maintainers mailing list <[hidden email]>
> Subject: Re: random numbers in tests
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 08/01/2012 11:39 AM, John W. Eaton wrote:
>>
>> >
>> > Since all I had done was rename some files, I couldn't understand what
>> > could have caused the problem.  After determining that the changeset
>> > that renamed the files was definitely the one that resulted in the
>> > failed tests, and noting that running the tests from the command line
>> > worked, I was really puzzled.  Only after all of that did I finally
>> > notice that the tests use random data.
>> >
>> > It seems the reason the change reliably affected "make check" was that
>> > by renaming the DLD-FUNCTION directory to dldfcn, the tests were run
>> > in a different order.  Previously, the tests from files in the
>> > DLD-FUNCTION directory were executed first.  Now they were done later,
>> > after many other tests, some of which have random values, and some
>> > that may set the random number generator state.
>> >
>> > Is this sort of thing also what caused the recent problem with the
>> > svds test failure?
> It sure looks like it.  Some of the examples I gave yesterday showed
> that the SVD on sparse data algorithm had results varying at least four
> times esp(), and that was just one or two examples.  If one were to look
> at hundreds or thousands of examples, I would think it is very likely to
> exceed 10*eps.
>
> Spline fits and simulations can have less accuracy as well.  So the
> 10*eps tolerance is a bigger question.
>
>
>> > Should we always set the random number generator state for tests so
>> > that they can be reproducible?  If so, should this be done
>> > automatically by the testing functions, or left to each individual
>> > test?
> I would say that putting in a fixed input that passes is not the thing
> to do.  The problem with that approach is if the library changes their
> algorithm slightly these same issues might pop up again when a library
> is updated and people will wonder what is wrong once again.
I also think we shouldn't "fix" the random data by initializing the seed in
test.m.  For complete testing one needs both directed tests, created by
programmers, and random tests to cover the cases that no human would think
of, but which are legal.  I think the current code re-organization is a
great chance to expose latent bugs.
>
> Instead, I think the sorts of approaches that Ed suggested yesterday is
> the thing to do.  I.e., come up with a reasonable estimate for how
> accurate such an algorithm should be and use that.  Octave is testing
> functionality here, not the ultimate accuracy of the algorithm, correct?
Actually we are interested in both things.  Users rely on an Octave
algorithm to do what it says (functionality) and to do it accurately
(tolerance).  For example, the square root function could use many
different algorithms.  One simple replacement for the sqrt() mapper
function on to the C library (the current Octave solution) would be to use
a root finding routine like fzero.  So, hypothetically,

function y = sqrt_rep (x)
  y = fzero (@(z) z*z -x, 0);
endfunction

If I try "sqrt_rep (5)" I get "-2.2361".  Excepting the sign of the result,
the answer is accurate to the 5 digits displayed.  However, if I try abs
(ans) - sqrt (5) I get 1.4e-8 so the ultimate accuracy of this algorithm
isn't very good although the algorithm is functional.

Also, we do want more than just a *reasonable* estimate of the accuracy.
We try and test close to the bounds of the accuracy of the algorithm
because, even with a good algorithm, there are plenty of ways that the
implementation can be screwed up.  Perhaps we cast intermediate results to
float and thereby throw away accuracy.  What if we have an off-by-1 error
in a loop condition that stops us from doing the final iteration that
drives the accuracy below eps?  Having tight tolerances helps us understand
whether it is the algorithm or the programmer which is failing.  If it can
be determined with certainty that it is the algorithm, rather than the
implementation, which is underperforming then I think it is acceptable at
that point to raise tolerances to stop %!test failures.
>
> I tried running some of the examples to see how accurate the spline fit
> is, but kept getting errors about some pp structure not having 'n' as a
> member.
That is really odd.  You might try 'which splinefit' to make sure you are
pulling the correct one from scripts/polynomial directory.  I run 'test
splinefit' and it works fine on revision dda73cb60ac5.

A good way to test whether there are repeatability issues is not to run
'make check', which takes too long, but to run 'test ("suspect_fcn")'.  I
ran the following, after commenting out the line that initializes the randn
seed to 13.

fid = fopen ("tst_spline.err", "w");
for i = 1:1000
  bm(i) = test ("splinefit", "quiet", fid);
endfor
sum (bm)
fclose (fid);

The benchmark sum shows 898 so approximately 10% of the time the tests
fail.  On looking through the results in the tst_spline.err log I see that
it is only when randn has returned a value exceptionally far from the
expected mean of 0 do we get a test failure.  Given that randn can return
any real number between -Inf, +Inf we might be better testing the function
with a narrower input.

Replacing
%! yb = randn (size (xb));  range is [-Inf, Inf]
with
%! yb = 2*rand (size (xb)) - 1; range is [-1,1]

changes the success rate to 999/1000.

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: splinefit test failures

Daniel Sebald
On 08/02/2012 11:44 AM, Rik wrote:

> On 08/01/2012 09:59 AM, [hidden email] wrote:
>> Message: 7
>> Date: Wed, 01 Aug 2012 11:59:02 -0500
>> From: Daniel J Sebald<[hidden email]>
>> To: "John W. Eaton"<[hidden email]>
>> Cc: octave maintainers mailing list<[hidden email]>
>> Subject: Re: random numbers in tests
>> Message-ID:<[hidden email]>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 08/01/2012 11:39 AM, John W. Eaton wrote:
>>>
>>>>
>>>> Since all I had done was rename some files, I couldn't understand what
>>>> could have caused the problem.  After determining that the changeset
>>>> that renamed the files was definitely the one that resulted in the
>>>> failed tests, and noting that running the tests from the command line
>>>> worked, I was really puzzled.  Only after all of that did I finally
>>>> notice that the tests use random data.
>>>>
>>>> It seems the reason the change reliably affected "make check" was that
>>>> by renaming the DLD-FUNCTION directory to dldfcn, the tests were run
>>>> in a different order.  Previously, the tests from files in the
>>>> DLD-FUNCTION directory were executed first.  Now they were done later,
>>>> after many other tests, some of which have random values, and some
>>>> that may set the random number generator state.
>>>>
>>>> Is this sort of thing also what caused the recent problem with the
>>>> svds test failure?
>> It sure looks like it.  Some of the examples I gave yesterday showed
>> that the SVD on sparse data algorithm had results varying at least four
>> times esp(), and that was just one or two examples.  If one were to look
>> at hundreds or thousands of examples, I would think it is very likely to
>> exceed 10*eps.
>>
>> Spline fits and simulations can have less accuracy as well.  So the
>> 10*eps tolerance is a bigger question.
>>
>>
>>>> Should we always set the random number generator state for tests so
>>>> that they can be reproducible?  If so, should this be done
>>>> automatically by the testing functions, or left to each individual
>>>> test?
>> I would say that putting in a fixed input that passes is not the thing
>> to do.  The problem with that approach is if the library changes their
>> algorithm slightly these same issues might pop up again when a library
>> is updated and people will wonder what is wrong once again.
> I also think we shouldn't "fix" the random data by initializing the seed in
> test.m.  For complete testing one needs both directed tests, created by
> programmers, and random tests to cover the cases that no human would think
> of, but which are legal.  I think the current code re-organization is a
> great chance to expose latent bugs.

Agreed.


>> Instead, I think the sorts of approaches that Ed suggested yesterday is
>> the thing to do.  I.e., come up with a reasonable estimate for how
>> accurate such an algorithm should be and use that.  Octave is testing
>> functionality here, not the ultimate accuracy of the algorithm, correct?
> Actually we are interested in both things.  Users rely on an Octave
> algorithm to do what it says (functionality) and to do it accurately
> (tolerance).  For example, the square root function could use many
> different algorithms.  One simple replacement for the sqrt() mapper
> function on to the C library (the current Octave solution) would be to use
> a root finding routine like fzero.  So, hypothetically,
>
> function y = sqrt_rep (x)
>    y = fzero (@(z) z*z -x, 0);
> endfunction
>
> If I try "sqrt_rep (5)" I get "-2.2361".  Excepting the sign of the result,
> the answer is accurate to the 5 digits displayed.  However, if I try abs
> (ans) - sqrt (5) I get 1.4e-8 so the ultimate accuracy of this algorithm
> isn't very good although the algorithm is functional.

Yes, that is fairly inaccurate in terms of computer precision, but at
the same time may be adequate for many applications.  1e-8 isn't awful
for some applications.  (I wouldn't use it to compute square root, but
if it solved some general problem who's answer I knew were sqrt(5) I
might be satisfied.)  In the past I've tested some of Octave's Runge
Kutta differential equation solvers and had accuracy on the order of
1e-5 or so.  I would have liked better.  The point is that one should
check the tools being used and know limitations.

But you've swayed me on the "trust Octave accuracy" point.


> Also, we do want more than just a *reasonable* estimate of the accuracy.
> We try and test close to the bounds of the accuracy of the algorithm
> because, even with a good algorithm, there are plenty of ways that the
> implementation can be screwed up.  Perhaps we cast intermediate results to
> float and thereby throw away accuracy.  What if we have an off-by-1 error
> in a loop condition that stops us from doing the final iteration that
> drives the accuracy below eps?  Having tight tolerances helps us understand
> whether it is the algorithm or the programmer which is failing.  If it can
> be determined with certainty that it is the algorithm, rather than the
> implementation, which is underperforming then I think it is acceptable at
> that point to raise tolerances to stop %!test failures.

Well, if Octave is using ARPACK, outside of Octave's control, then there
isn't much alternative.  Nonetheless, I think that tolerances of 2*eps
are just too small.  100*eps is more like it.  If something is accurate
to 1e-14, that's amazing.  1e-8 is something we can fathom, but 1e-14 is
like subatomic in my consciousness.  Check this result:

octave:1> sqrt(5)*sqrt(5) - 5
ans =  8.8818e-16

sqrt() isn't even accurate to 2*eps, so expecting other numerical
techniques to be that good is too tight of tolerance.


>> I tried running some of the examples to see how accurate the spline fit
>> is, but kept getting errors about some pp structure not having 'n' as a
>> member.
> That is really odd.  You might try 'which splinefit' to make sure you are
> pulling the correct one from scripts/polynomial directory.  I run 'test
> splinefit' and it works fine on revision dda73cb60ac5.
>
> A good way to test whether there are repeatability issues is not to run
> 'make check', which takes too long, but to run 'test ("suspect_fcn")'.  I
> ran the following, after commenting out the line that initializes the randn
> seed to 13.
>
> fid = fopen ("tst_spline.err", "w");
> for i = 1:1000
>    bm(i) = test ("splinefit", "quiet", fid);
> endfor
> sum (bm)
> fclose (fid);
>
> The benchmark sum shows 898 so approximately 10% of the time the tests
> fail.  On looking through the results in the tst_spline.err log I see that
> it is only when randn has returned a value exceptionally far from the
> expected mean of 0 do we get a test failure.  Given that randn can return
> any real number between -Inf, +Inf we might be better testing the function
> with a narrower input.
>
> Replacing
> %! yb = randn (size (xb));  range is [-Inf, Inf]
> with
> %! yb = 2*rand (size (xb)) - 1; range is [-1,1]
>
> changes the success rate to 999/1000.

Seems a worthwhile test to me.  Again, I'd loosen the tolerance a bit to
the point of making random deviations outside of tolerance small.  And
if that doesn't happen, then it needs to be fixed.

Dan
Reply | Threaded
Open this post in threaded view
|

Re: splinefit test failures

Rik-4
On 08/02/2012 10:45 AM, Daniel J Sebald wrote:

>
>> The benchmark sum shows 898 so approximately 10% of the time the tests
>> fail.  On looking through the results in the tst_spline.err log I see that
>> it is only when randn has returned a value exceptionally far from the
>> expected mean of 0 do we get a test failure.  Given that randn can return
>> any real number between -Inf, +Inf we might be better testing the function
>> with a narrower input.
>>
>> Replacing
>> %! yb = randn (size (xb));  range is [-Inf, Inf]
>> with
>> %! yb = 2*rand (size (xb)) - 1; range is [-1,1]
>>
>> changes the success rate to 999/1000.
>
>
> Seems a worthwhile test to me.  Again, I'd loosen the tolerance a bit to
> the point of making random deviations outside of tolerance small.  And if
> that doesn't happen, then it needs to be fixed.
>
> Dan
E-mail is not a very clear communication medium.  Is your vote to keep
randn, drop the initialization to a specific seed, and loosen the
tolerance?  Or is it to switch to rand()?

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: splinefit test failures

Daniel Sebald
On 08/02/2012 01:24 PM, Rik wrote:

> On 08/02/2012 10:45 AM, Daniel J Sebald wrote:
>>
>>> The benchmark sum shows 898 so approximately 10% of the time the tests
>>> fail.  On looking through the results in the tst_spline.err log I see that
>>> it is only when randn has returned a value exceptionally far from the
>>> expected mean of 0 do we get a test failure.  Given that randn can return
>>> any real number between -Inf, +Inf we might be better testing the function
>>> with a narrower input.
>>>
>>> Replacing
>>> %! yb = randn (size (xb));  range is [-Inf, Inf]
>>> with
>>> %! yb = 2*rand (size (xb)) - 1; range is [-1,1]
>>>
>>> changes the success rate to 999/1000.
>>
>>
>> Seems a worthwhile test to me.  Again, I'd loosen the tolerance a bit to
>> the point of making random deviations outside of tolerance small.  And if
>> that doesn't happen, then it needs to be fixed.
>>
>> Dan
> E-mail is not a very clear communication medium.  Is your vote to keep
> randn, drop the initialization to a specific seed, and loosen the
> tolerance?  Or is it to switch to rand()?

Drop the specific seed, use random.  Try several (hundred/thousand)
repeated tests to explore the space.

Loosen tolerance on the order of 100*eps.  Maybe that is too big, but I
think a good point is somewhere between 10*eps and 100*eps.  Seeing the
results I have, 10*eps feels like it is within random deviation.

randn/rand, not sure.  rand() will certainly get rid of large deviations
as an input.  But maybe the issue is the tolerance too small.  I'd start
with trying a tolerance of 30*eps.  Maybe that will obviate the
randn/rand problem.

Dan
Reply | Threaded
Open this post in threaded view
|

Re: splinefit test failures

eem2314
In reply to this post by Rik-4


On Thu, Aug 2, 2012 at 9:44 AM, Rik <[hidden email]> wrote:
On 08/01/2012 09:59 AM, [hidden email] wrote:
> Message: 7
> Date: Wed, 01 Aug 2012 11:59:02 -0500
> From: Daniel J Sebald <[hidden email]>
> To: "John W. Eaton" <[hidden email]>
> Cc: octave maintainers mailing list <[hidden email]>
> Subject: Re: random numbers in tests
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 08/01/2012 11:39 AM, John W. Eaton wrote:
>>
>> >
>> > Since all I had done was rename some files, I couldn't understand what
>> > could have caused the problem.  After determining that the changeset
>> > that renamed the files was definitely the one that resulted in the
>> > failed tests, and noting that running the tests from the command line
>> > worked, I was really puzzled.  Only after all of that did I finally
>> > notice that the tests use random data.
>> >
>> > It seems the reason the change reliably affected "make check" was that
>> > by renaming the DLD-FUNCTION directory to dldfcn, the tests were run
>> > in a different order.  Previously, the tests from files in the
>> > DLD-FUNCTION directory were executed first.  Now they were done later,
>> > after many other tests, some of which have random values, and some
>> > that may set the random number generator state.
>> >
>> > Is this sort of thing also what caused the recent problem with the
>> > svds test failure?
> It sure looks like it.  Some of the examples I gave yesterday showed
> that the SVD on sparse data algorithm had results varying at least four
> times esp(), and that was just one or two examples.  If one were to look
> at hundreds or thousands of examples, I would think it is very likely to
> exceed 10*eps.
>
> Spline fits and simulations can have less accuracy as well.  So the
> 10*eps tolerance is a bigger question.
>
>
>> > Should we always set the random number generator state for tests so
>> > that they can be reproducible?  If so, should this be done
>> > automatically by the testing functions, or left to each individual
>> > test?
> I would say that putting in a fixed input that passes is not the thing
> to do.  The problem with that approach is if the library changes their
> algorithm slightly these same issues might pop up again when a library
> is updated and people will wonder what is wrong once again.
I also think we shouldn't "fix" the random data by initializing the seed in
test.m.  For complete testing one needs both directed tests, created by
programmers, and random tests to cover the cases that no human would think
of, but which are legal.  I think the current code re-organization is a
great chance to expose latent bugs.
>
> Instead, I think the sorts of approaches that Ed suggested yesterday is
> the thing to do.  I.e., come up with a reasonable estimate for how
> accurate such an algorithm should be and use that.  Octave is testing
> functionality here, not the ultimate accuracy of the algorithm, correct?
Actually we are interested in both things.  Users rely on an Octave
algorithm to do what it says (functionality) and to do it accurately
(tolerance).  For example, the square root function could use many
different algorithms.  One simple replacement for the sqrt() mapper
function on to the C library (the current Octave solution) would be to use
a root finding routine like fzero.  So, hypothetically,

function y = sqrt_rep (x)
  y = fzero (@(z) z*z -x, 0);
endfunction

If I try "sqrt_rep (5)" I get "-2.2361".  Excepting the sign of the result,
the answer is accurate to the 5 digits displayed.  However, if I try abs
(ans) - sqrt (5) I get 1.4e-8 so the ultimate accuracy of this algorithm
isn't very good although the algorithm is functional.

Also, we do want more than just a *reasonable* estimate of the accuracy.
We try and test close to the bounds of the accuracy of the algorithm
because, even with a good algorithm, there are plenty of ways that the
implementation can be screwed up.  Perhaps we cast intermediate results to
float and thereby throw away accuracy.  What if we have an off-by-1 error
in a loop condition that stops us from doing the final iteration that
drives the accuracy below eps?  Having tight tolerances helps us understand
whether it is the algorithm or the programmer which is failing.  If it can
be determined with certainty that it is the algorithm, rather than the
implementation, which is underperforming then I think it is acceptable at
that point to raise tolerances to stop %!test failures.

What I meant was that error bounds must take account of the size of the numbers
in the data; for the splinefit problem that simply means using something like

   10 * eps() * max(norm(y), 1.0)

as a tolerance instead of

   10 * eps()

Doing this I get zero failures out of 300 instead of 82 with the abs tolerance.


--
Ed Meyer