

I've just rerun the sciview tests, sinced I'd presumed that all of the recent
changes in Octave would result in significantly faster code. It basically
is just that the last test is much slower. I think the [] operator is
slower in 2.1.54, though I need to check some more. Any clues?
See the attached log files
Regards
David

David Bateman [hidden email]
Motorola CRM +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 1 69 35 77 01 (Fax)
91193 GifSurYvette FRANCE
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


On 17Feb2004, David Bateman < [hidden email]> wrote:
 I've just rerun the sciview tests, sinced I'd presumed that all of the recent
 changes in Octave would result in significantly faster code. It basically
 is just that the last test is much slower. I think the [] operator is
 slower in 2.1.54, though I need to check some more. Any clues?
The [] operator for creating arrays was modified to work with Nd
numeric arrays, so it could be somewhat slower. It might be that the
slow part is the way indexing and bounds checking works for inserting
one Nd array into another, but that's just a guess.
jwe


David Bateman wrote:
> I've just rerun the sciview tests, sinced I'd presumed that all of the recent
> changes in Octave would result in significantly faster code. It basically
> is just that the last test is much slower. I think the [] operator is
> slower in 2.1.54, though I need to check some more. Any clues?
I also noticed that recursion is slower. For sciview tests it is in
grand common divisors tests (gcd2.m).
2.1.50:
750,000 Fibonacci numbers calculation (vector calc)_ (sec): 0.3791
Creation of a 2250x2250 Hilbert matrix (matrix calc) (sec): 0.8338
Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.447
Creation of a 220x220 Toeplitz matrix (loops)_______ (sec): 1.031
Escoufier's method on a 37x37 matrix (mixed)________ (sec): 3.515
2.1.54:
750,000 Fibonacci numbers calculation (vector calc)_ (sec): 0.3799
Creation of a 2250x2250 Hilbert matrix (matrix calc) (sec): 0.8019
Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.5951
Creation of a 220x220 Toeplitz matrix (loops)_______ (sec): 0.9687
Escoufier's method on a 37x37 matrix (mixed)________ (sec): 4.484
(As I reported early I also see slow down in sylvester_matrix()
which uses recursion as well).
This change seems to occur in 2.1.53 > 54 transition, I just do not
have saved data to show.
> Regards
> David
>
>
Dmitri.


On 17Feb2004, Dmitri A. Sergatskov < [hidden email]> wrote:
 David Bateman wrote:
 > I've just rerun the sciview tests, sinced I'd presumed that all of the recent
 > changes in Octave would result in significantly faster code. It basically
 > is just that the last test is much slower. I think the [] operator is
 > slower in 2.1.54, though I need to check some more. Any clues?

 I also noticed that recursion is slower. For sciview tests it is in
 grand common divisors tests (gcd2.m).
There were some changes to the parse tree code to store more data to
allow for better information from warning messages in the future.
Saving and restoring the data requires some additional calls to
functions in the unwind_protect class. It's possible that this is the
cause of what you are seeing.
But, recursion and function calls are known to be slow in Octave and
there is no simple fix. If you want them to be faster, then I think
you will need to write a new interpreter that takes the current
parse tree information and converts it to something that can be uased
by a stack based interperter or a justintime compiler.
jwe


In reply to this post by Dmitri A. Sergatskov2
According to Dmitri A. Sergatskov < [hidden email]> (on 02/17/04):
>
> I also noticed that recursion is slower. For sciview tests it is in
> grand common divisors tests (gcd2.m).
>
> 2.1.50:
<snip>
> Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.447
<snip>
> Escoufier's method on a 37x37 matrix (mixed)________ (sec): 3.515
>
> 2.1.54:
<snip>
> Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.5951
<snip>
> Escoufier's method on a 37x37 matrix (mixed)________ (sec): 4.484
The gcd stuff might just be a timing error. The code used for the timing
in the sciview tests is pretty rough, using tic/toc. I've replaced this
in my version using cputime instead. and get the follow
2.1.50:
Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.6133
Escoufier's method on a 37x37 matrix (mixed)________ (sec): 2.96
2.1.54:
Grand common divisors of 70,000 pairs (recursion)___ (sec): 0.6033
Escoufier's method on a 37x37 matrix (mixed)________ (sec): 5.09
So the only significant slow up I see is in the last test. Here is another
interting set of tests
2.1.50
tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc
ans = 0.079843
tic; x=0; for i=1:1e3; x++; endfor; toc
ans = 0.018920
2.1.54
tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc
ans = 0.87523
tic; x=0; for i=1:1e3; x++; endfor; toc
ans = 0.014731
Which shows that that the loops are roughly the same speed in 2.1.50
and 2.1.54 and maybe even slightly faster in 2.1.54. However the
concatenation is at about 10 times slower !!! This is exactly the same
type of code that is used in the last of the sciview tests. So this is
bad. Looks like Nd concatenation support needs some optimizing.
Regards
David

David Bateman [hidden email]
Motorola CRM +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 1 69 35 77 01 (Fax)
91193 GifSurYvette FRANCE
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


David Bateman wrote:
> The gcd stuff might just be a timing error. The code used for the timing
> in the sciview tests is pretty rough, using tic/toc. I've replaced this
> in my version using cputime instead. and get the follow
cputime does not work with pthreads, so I have to use tic/toc.
Here is another example of recursion slowdown:
2.1.53:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 4.2025
2.1.54:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 11.574
(those are both stock releases no patches for SMP/pthreads etc...;
enableshared disablestatic ; O3 march=athlonmp; AthlonMP x2)
> So the only significant slow up I see is in the last test. Here is another
> interting set of tests
>
> 2.1.50
> tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc
> ans = 0.079843
> tic; x=0; for i=1:1e3; x++; endfor; toc
> ans = 0.018920
I do not trust tic/toc numbers less then 0.1 sec. So I increased the index:
2.1.53:
tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
ans = 3.2951
2.1.54:
tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
ans = 20.234
2.1.53:
tic; x=0; for i=1:1e4; x++; endfor; toc
ans = 0.024073
tic; x=0; for i=1:1e5; x++; endfor; toc
ans = 0.24130
tic; x=0; for i=1:1e6; x++; endfor; toc
ans = 2.3381
2.1.54:
tic; x=0; for i=1:1e4; x++; endfor; toc
ans = 0.028044
tic; x=0; for i=1:1e5; x++; endfor; toc
ans = 0.26263
tic; x=0; for i=1:1e6; x++; endfor; toc
ans = 2.6506
(I do not understand it, but my old records show that loops
were slower, it seems that something else in my system changed that
speeded it up.)
>
> Regards
> David
>
Sincerely,
Dmitri.


According to Dmitri A. Sergatskov < [hidden email]> (on 02/17/04):
>
> cputime does not work with pthreads, so I have to use tic/toc.
> Here is another example of recursion slowdown:
>
> 2.1.53:
>
> tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> ans = 4.2025
>
> 2.1.54:
>
> tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> ans = 11.574
Ok, I confirm your results, as
2.1.50:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 6.1742
2.1.54:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 40.233
However, what is it in the sylvester_matrix function that makes you think the
problem is due to recursion. There is a line in sylvester_matrix
retval = [tmp, tmp; tmp, tmp];
I tried replacing this with
retval = 1;
just to check whether the problem was due to recursion or this
concatentaion operation. The results were
2.1.50:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 2.9223
2.1.54
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 2.6302
So to me the problem is clearly in the concatenation operations, not
the recursion. Interestingly the above allows an estimate of the slowup
in the concatenation operation to be derived as
(40.233  2.6302) / (6.1742  2.9223) = 11.563
So there is a factor of roughly 11 slowup in the concatenation operations.
Regards
D.

David Bateman [hidden email]
Motorola CRM +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 1 69 35 77 01 (Fax)
91193 GifSurYvette FRANCE
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


On 18Feb2004, David Bateman < [hidden email]> wrote:
 So to me the problem is clearly in the concatenation operations, not
 the recursion. Interestingly the above allows an estimate of the slowup
 in the concatenation operation to be derived as

 (40.233  2.6302) / (6.1742  2.9223) = 11.563

 So there is a factor of roughly 11 slowup in the concatenation operations.
I think the following change should help.
Thanks,
jwe
liboctave/ChangeLog:
20040218 John W. Eaton < [hidden email]>
* Array.cc (Array<T>::insertN (const Array<T>&, int, int)):
Rename from Array<T>::insert.
(Array<T>::insert2 (const Array<T>&, int, int)):
Reinstate old Array<T>::insert function under this name.
(Array<T>::insert (const Array<T>&, int, int)):
New function. Dispatch to insert2 or insertN as appropriate.
Index: liboctave/Array.cc
===================================================================
RCS file: /usr/local/cvsroot/octave/liboctave/Array.cc,v
retrieving revision 1.104
diff u r1.104 Array.cc
 liboctave/Array.cc 16 Feb 2004 05:56:50 0000 1.104
+++ liboctave/Array.cc 18 Feb 2004 12:50:17 0000
@@ 931,6 +931,39 @@
Array<T>&
Array<T>::insert (const Array<T>& a, int r, int c)
{
+ if (ndims () == 2 && a.ndims () == 2)
+ insert2 (a, r, c);
+ else
+ insertN (a, r, c);
+
+ return *this;
+}
+
+
+template <class T>
+Array<T>&
+Array<T>::insert2 (const Array<T>& a, int r, int c)
+{
+ int a_rows = a.rows ();
+ int a_cols = a.cols ();
+
+ if (r < 0  r + a_rows > rows ()  c < 0  c + a_cols > cols ())
+ {
+ (*current_liboctave_error_handler) ("range error for insert");
+ return *this;
+ }
+
+ for (int j = 0; j < a_cols; j++)
+ for (int i = 0; i < a_rows; i++)
+ elem (r+i, c+j) = a.elem (i, j);
+
+ return *this;
+}
+
+template <class T>
+Array<T>&
+Array<T>::insertN (const Array<T>& a, int r, int c)
+{
dim_vector a_dv = a.dims ();
int n = a_dv.length ();
Index: liboctave/Array.h
===================================================================
RCS file: /usr/local/cvsroot/octave/liboctave/Array.h,v
retrieving revision 1.84
diff u r1.84 Array.h
 liboctave/Array.h 16 Feb 2004 05:07:23 0000 1.84
+++ liboctave/Array.h 18 Feb 2004 12:50:17 0000
@@ 446,8 +446,10 @@
{ resize_and_fill (dv, val); }
Array<T>& insert (const Array<T>& a, int r, int c);
+ Array<T>& insert2 (const Array<T>& a, int r, int c);
+ Array<T>& insertN (const Array<T>& a, int r, int c);
 Array<T>& insert (const Array<T>& a, const Array<int>& dv);
+ Array<T>& insert (const Array<T>& a, const Array<int>& idx);
bool is_square (void) const { return (dim1 () == dim2 ()); }


According to John W. Eaton < [hidden email]> (on 02/18/04):
> On 18Feb2004, David Bateman < [hidden email]> wrote:
>
>  So to me the problem is clearly in the concatenation operations, not
>  the recursion. Interestingly the above allows an estimate of the slowup
>  in the concatenation operation to be derived as
> 
>  (40.233  2.6302) / (6.1742  2.9223) = 11.563
> 
>  So there is a factor of roughly 11 slowup in the concatenation operations.
>
> I think the following change should help.
>
2.1.50:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 6.1763
2.1.54+patch:
tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
ans = 6.6461
Now I'm happy :)
Cheers
David

David Bateman [hidden email]
Motorola CRM +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 1 69 35 77 01 (Fax)
91193 GifSurYvette FRANCE
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


In reply to this post by Dmitri A. Sergatskov2
David and Dmitri,
Whilst it is a bit at right angles to the discussion because the fault
with the [] operator was fixed, should we really worry about constructs
like
j = 1e 4; tic; x = []; for i=1:j; x = [x, i]; endfor; toc
that have square law scaling with the length of i?
j = 1e4; tic; x = zeros(1, j); for i = 1:j; x(i) = i; endfor; toc
is 30 times faster, at j=1e4, and is linear in j.
Dmitri A. Sergatskov wrote:
> David Bateman wrote:
>
>> The gcd stuff might just be a timing error. The code used for the timing
>> in the sciview tests is pretty rough, using tic/toc. I've replaced this
>> in my version using cputime instead. and get the follow
>
>
> cputime does not work with pthreads, so I have to use tic/toc.
> Here is another example of recursion slowdown:
>
> 2.1.53:
>
> tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> ans = 4.2025
>
> 2.1.54:
>
> tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> ans = 11.574
>
> (those are both stock releases no patches for SMP/pthreads etc...;
> enableshared disablestatic ; O3 march=athlonmp; AthlonMP x2)
>
>
>> So the only significant slow up I see is in the last test. Here is
>> another
>> interting set of tests
>>
>> 2.1.50
>> tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc ans = 0.079843
>> tic; x=0; for i=1:1e3; x++; endfor; toc ans = 0.018920
>
>
> I do not trust tic/toc numbers less then 0.1 sec. So I increased the
> index:
>
> 2.1.53:
>
> tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
> ans = 3.2951
>
> 2.1.54:
>
> tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
> ans = 20.234
>
>
>
> 2.1.53:
>
> tic; x=0; for i=1:1e4; x++; endfor; toc
> ans = 0.024073
> tic; x=0; for i=1:1e5; x++; endfor; toc
> ans = 0.24130
> tic; x=0; for i=1:1e6; x++; endfor; toc
> ans = 2.3381
>
> 2.1.54:
>
> tic; x=0; for i=1:1e4; x++; endfor; toc
> ans = 0.028044
> tic; x=0; for i=1:1e5; x++; endfor; toc
> ans = 0.26263
> tic; x=0; for i=1:1e6; x++; endfor; toc
> ans = 2.6506
>
>
> (I do not understand it, but my old records show that loops
> were slower, it seems that something else in my system changed that
> speeded it up.)
>
>>
>> Regards
>> David
>>
>
> Sincerely,
> Dmitri.
>
>


On 18Feb2004, Paul Thomas < [hidden email]> wrote:
 Whilst it is a bit at right angles to the discussion because the fault
 with the [] operator was fixed, should we really worry about constructs
 like

 j = 1e 4; tic; x = []; for i=1:j; x = [x, i]; endfor; toc

 that have square law scaling with the length of i?

 j = 1e4; tic; x = zeros(1, j); for i = 1:j; x(i) = i; endfor; toc

 is 30 times faster, at j=1e4, and is linear in j.
It's true that you shouldn't be using the first form if you can avoid
it, but the fact that it was so much slower going from 2.1.53 to
2.1.54 showed that the run time was not completely dominated by
reallocating memory. So I think it would also mean that we would have
also seen a big drop in performance for the [] operator with a large
number of elements.
jwe


Yes, that's why I said that the comment was a bit at right angles.
By the way, John, I have been slowly but surely following up those
Mickey Mouse timing tests by following what the parser and then the
treewalker do with them. In essence I have been trying to understand
why up to a few thousand CPU operations are needed to do a=b+c; I have
come to appreciate what an extraordinary job you have done with the
interpreter. So perhaps, it is timely to congratulate you and say,
"Well done", as the 12th birthday of octave comes up?
One thing that I have been trying to analyse is what proportion of the
time is devoted to doing what kind of task in, say, a simple binary
operation and assignment. Apart from the obvious overheads, such as
array bound checking, it strikes me that leaving open the variable
types, until the last moment, is quite an expensive (if not the
dominant) use of time.
I wonder if the parser can have a stab at identifying the types and,
when it is possible, a specific tree_expression substituted for the
general one? Thus Paul Kienzle's tot=0; x=[1:1e5]; for
i=1:1e5;tot=tot+x(i);end could be made to thunder along because the
types and the requisite function for the binary_op are identified from
the outset by the initial assignments. Sometimes, as for function
arguments, it would be necessary to work back from the context in which
variables are used later on. This argues, I guess, for a pass between
the lexing and the parsing. I will try and analyse, in the following
weeks, how well such a thing would work for real life functions or
scripts. Clearly the likes of function [b,a]=myfunc(a,b);endfunction
will beat the system (how about optional typing?) but I do not think
that this need generally be the case.
Best regards
Paul Thomas
John W. Eaton wrote:
>On 18Feb2004, Paul Thomas < [hidden email]> wrote:
>
> Whilst it is a bit at right angles to the discussion because the fault
> with the [] operator was fixed, should we really worry about constructs
> like
>
> j = 1e 4; tic; x = []; for i=1:j; x = [x, i]; endfor; toc
>
> that have square law scaling with the length of i?
>
> j = 1e4; tic; x = zeros(1, j); for i = 1:j; x(i) = i; endfor; toc
>
> is 30 times faster, at j=1e4, and is linear in j.
>
>It's true that you shouldn't be using the first form if you can avoid
>it, but the fact that it was so much slower going from 2.1.53 to
>2.1.54 showed that the run time was not completely dominated by
>reallocating memory. So I think it would also mean that we would have
>also seen a big drop in performance for the [] operator with a large
>number of elements.
>
>jwe
>
>
>
>


On 18Feb2004, Paul Thomas < [hidden email]> wrote:
 One thing that I have been trying to analyse is what proportion of the
 time is devoted to doing what kind of task in, say, a simple binary
 operation and assignment. Apart from the obvious overheads, such as
 array bound checking, it strikes me that leaving open the variable
 types, until the last moment, is quite an expensive (if not the
 dominant) use of time.
Perhaps. I have not tried to profile it. Another thing that could
make the interpreter slow is that because it is a simple treebased
interpreter, the evaluator results in a pointerchasing problem, which
can be slow (or at least slower than a stackbased interpreter) on
modern processors becuase you have to load the data for a given node
in the tree and then look at the pointers it contains before you can
load the next node(s) in the tree. Also, the interpreter code is
spread out all over the place, so it probably doesn't fit (and stay)
in a cache very well. But this is just wild guessing from someone who
doesn't really know that much about the details.
 I wonder if the parser can have a stab at identifying the types and,
 when it is possible, a specific tree_expression substituted for the
 general one?
Type inference will be the hardest part of writing a good Octave
compiler.
jwe


Paul,
Basically I know you shouldn't write code in the way I used to test
the problem and so this might not pratically have result in any
problems (expect for sylvester_matrix ). My main concern recently has
been to try and improve the benchmarked performance of Octave, and one
benchmark that Octave performed very badly on can be found at
http://www.sciviews.org/other/benchmark.htmAs the author of these benchmarks states, some of these tests have
been written badly to torture test the interpreter. This is particular
true of the last two tests where "tic; x = []; for i=1:1e3; x = [x,
i]; endfor; toc" is a fair summary of one of them. So since this is
one of the metric for their test it would have been a pity to see a
factor of 10 slowup in Octave.
In any case this benchmark pretty much explains much of my recent
contributions. The stuff I did on sort and randn recently in
octaveforge, also the eigenvalue patch that doesn't calculate
eignevectors if you don't want them, and the FFT stuff. With these
changes and the fact the 2.1.42 didn't properly use LAPACK/ATLAS,
where Octave does now, I estimate the final total in this benchmark
for the CVS version of octave to have come down from 27.76 to about 15
or 16. Whether this reflects the speedup a real user will see is
more questionable however....
Regards
David
Daprès Paul Thomas < [hidden email]> (le 18/02/2004):
> David and Dmitri,
>
> Whilst it is a bit at right angles to the discussion because the fault
> with the [] operator was fixed, should we really worry about constructs
> like
>
> j = 1e 4; tic; x = []; for i=1:j; x = [x, i]; endfor; toc
>
> that have square law scaling with the length of i?
>
> j = 1e4; tic; x = zeros(1, j); for i = 1:j; x(i) = i; endfor; toc
>
> is 30 times faster, at j=1e4, and is linear in j.
>
> Dmitri A. Sergatskov wrote:
>
> >David Bateman wrote:
> >
> >>The gcd stuff might just be a timing error. The code used for the timing
> >>in the sciview tests is pretty rough, using tic/toc. I've replaced this
> >>in my version using cputime instead. and get the follow
> >
> >
> >cputime does not work with pthreads, so I have to use tic/toc.
> >Here is another example of recursion slowdown:
> >
> >2.1.53:
> >
> >tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> >ans = 4.2025
> >
> >2.1.54:
> >
> >tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
> >ans = 11.574
> >
> >(those are both stock releases no patches for SMP/pthreads etc...;
> >enableshared disablestatic ; O3 march=athlonmp; AthlonMP x2)
> >
> >
> >>So the only significant slow up I see is in the last test. Here is
> >>another
> >>interting set of tests
> >>
> >>2.1.50
> >> tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc ans = 0.079843
> >>tic; x=0; for i=1:1e3; x++; endfor; toc ans = 0.018920
> >
> >
> >I do not trust tic/toc numbers less then 0.1 sec. So I increased the
> >index:
> >
> >2.1.53:
> >
> >tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
> >ans = 3.2951
> >
> >2.1.54:
> >
> >tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
> >ans = 20.234
> >
> >
> >
> >2.1.53:
> >
> >tic; x=0; for i=1:1e4; x++; endfor; toc
> >ans = 0.024073
> >tic; x=0; for i=1:1e5; x++; endfor; toc
> >ans = 0.24130
> >tic; x=0; for i=1:1e6; x++; endfor; toc
> >ans = 2.3381
> >
> >2.1.54:
> >
> >tic; x=0; for i=1:1e4; x++; endfor; toc
> >ans = 0.028044
> >tic; x=0; for i=1:1e5; x++; endfor; toc
> >ans = 0.26263
> >tic; x=0; for i=1:1e6; x++; endfor; toc
> >ans = 2.6506
> >
> >
> >(I do not understand it, but my old records show that loops
> >were slower, it seems that something else in my system changed that
> >speeded it up.)
> >
> >>
> >>Regards
> >>David
> >>
> >
> >Sincerely,
> >Dmitri.
> >
> >
>

David Bateman [hidden email]
Motorola CRM +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin +33 1 69 35 77 01 (Fax)
91193 GifSurYvette FRANCE
The information contained in this communication has been classified as:
[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


On Feb 18, 2004, at 7:14 PM, David Bateman wrote:
> In any case this benchmark pretty much explains much of my recent
> contributions. The stuff I did on sort and randn recently in
> octaveforge, also the eigenvalue patch that doesn't calculate
> eignevectors if you don't want them, and the FFT stuff. With these
> changes and the fact the 2.1.42 didn't properly use LAPACK/ATLAS,
> where Octave does now, I estimate the final total in this benchmark
> for the CVS version of octave to have come down from 27.76 to about 15
> or 16. Whether this reflects the speedup a real user will see is
> more questionable however....
I certainly appreciate your work. sort in particular is used
in many places it shouldn't be because it can avoid a
scripted forloop. O(n log n) is faster than O(n) even on
large problems if the constants are large enough.
So there is a direct benefit for things like hist and interp1.
Similarly, fft is used all over the place, so speedups here
affect much more than just the fft.
Fast random number generators are important in many
contexts, but maybe not so much in a scripted environment.
With a little massaging, though, we can have something
state of the art useful outside of octave.
Again, thanks!
Paul Kienzle
[hidden email]


Paul Kienzle wrote:
> Similarly, fft is used all over the place, so speedups here
> affect much more than just the fft.
>
> Fast random number generators are important in many
> contexts, but maybe not so much in a scripted environment.
> With a little massaging, though, we can have something
> state of the art useful outside of octave.
I would argue that fast (and good) random number generation is just as
important with Octave. Simulations of unlikely events (say
communications scenarios) can require a number of trials that translates
to hours of processor time. There are some speedup methods for linear
systems, but nonlinear systems are another story.
Dan


David,
Yes, I am very impressed by your efforts too. The FFT, SORT and RAND
upgrades that you all have put together are superb.
The point that I was making, in effect, was that the benchmarks should
be "optimised" in each of the languages that are being compared,
otherwise they are meaningless. For example, some langauges might use
the Standard Library strategy allocating more space than is called for,
when constructing a new container. Octave does this sometimes; eg. j =
1e4; tic;x = []; for i = 1:j; (i) = i; endfor; toc changes its spots
altogether when j>1e4 (I haven't checked in the source but I presume
that a reserve of about 8e4 bytes is being allocated?).
Using the [] operator to build up arrays is bad news in Matlab and
octave for reasons that you know and understand. I have no idea what
hidden wrinkles there are in the other languages in the benchmark. That
was the first thing that occurred to me when I first saw the benchmarks.
Regards
Paul
David Bateman wrote:
>Paul,
>
>Basically I know you shouldn't write code in the way I used to test
>the problem and so this might not pratically have result in any
>problems (expect for sylvester_matrix ). My main concern recently has
>been to try and improve the benchmarked performance of Octave, and one
>benchmark that Octave performed very badly on can be found at
>
> http://www.sciviews.org/other/benchmark.htm>
>As the author of these benchmarks states, some of these tests have
>been written badly to torture test the interpreter. This is particular
>true of the last two tests where "tic; x = []; for i=1:1e3; x = [x,
>i]; endfor; toc" is a fair summary of one of them. So since this is
>one of the metric for their test it would have been a pity to see a
>factor of 10 slowup in Octave.
>
>In any case this benchmark pretty much explains much of my recent
>contributions. The stuff I did on sort and randn recently in
>octaveforge, also the eigenvalue patch that doesn't calculate
>eignevectors if you don't want them, and the FFT stuff. With these
>changes and the fact the 2.1.42 didn't properly use LAPACK/ATLAS,
>where Octave does now, I estimate the final total in this benchmark
>for the CVS version of octave to have come down from 27.76 to about 15
>or 16. Whether this reflects the speedup a real user will see is
>more questionable however....
>
>Regards
>David
>
>
>Daprès Paul Thomas < [hidden email]> (le 18/02/2004):
>
>
>>David and Dmitri,
>>
>>Whilst it is a bit at right angles to the discussion because the fault
>>with the [] operator was fixed, should we really worry about constructs
>>like
>>
>>j = 1e 4; tic; x = []; for i=1:j; x = [x, i]; endfor; toc
>>
>>that have square law scaling with the length of i?
>>
>>j = 1e4; tic; x = zeros(1, j); for i = 1:j; x(i) = i; endfor; toc
>>
>>is 30 times faster, at j=1e4, and is linear in j.
>>
>>Dmitri A. Sergatskov wrote:
>>
>>
>>
>>>David Bateman wrote:
>>>
>>>
>>>
>>>>The gcd stuff might just be a timing error. The code used for the timing
>>>>in the sciview tests is pretty rough, using tic/toc. I've replaced this
>>>>in my version using cputime instead. and get the follow
>>>>
>>>>
>>>cputime does not work with pthreads, so I have to use tic/toc.
>>>Here is another example of recursion slowdown:
>>>
>>>2.1.53:
>>>
>>>tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
>>>ans = 4.2025
>>>
>>>2.1.54:
>>>
>>>tic; for n=1:1000; bm_x=sylvester_matrix(7) ; endfor ; toc
>>>ans = 11.574
>>>
>>>(those are both stock releases no patches for SMP/pthreads etc...;
>>>enableshared disablestatic ; O3 march=athlonmp; AthlonMP x2)
>>>
>>>
>>>
>>>
>>>>So the only significant slow up I see is in the last test. Here is
>>>>another
>>>>interting set of tests
>>>>
>>>>2.1.50
>>>>tic; x = []; for i=1:1e3; x = [x, i]; endfor; toc ans = 0.079843
>>>>tic; x=0; for i=1:1e3; x++; endfor; toc ans = 0.018920
>>>>
>>>>
>>>I do not trust tic/toc numbers less then 0.1 sec. So I increased the
>>>index:
>>>
>>>2.1.53:
>>>
>>>tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
>>>ans = 3.2951
>>>
>>>2.1.54:
>>>
>>>tic; x = []; for i=1:1e4; x = [x, i]; endfor; toc
>>>ans = 20.234
>>>
>>>
>>>
>>>2.1.53:
>>>
>>>tic; x=0; for i=1:1e4; x++; endfor; toc
>>>ans = 0.024073
>>>tic; x=0; for i=1:1e5; x++; endfor; toc
>>>ans = 0.24130
>>>tic; x=0; for i=1:1e6; x++; endfor; toc
>>>ans = 2.3381
>>>
>>>2.1.54:
>>>
>>>tic; x=0; for i=1:1e4; x++; endfor; toc
>>>ans = 0.028044
>>>tic; x=0; for i=1:1e5; x++; endfor; toc
>>>ans = 0.26263
>>>tic; x=0; for i=1:1e6; x++; endfor; toc
>>>ans = 2.6506
>>>
>>>
>>>(I do not understand it, but my old records show that loops
>>>were slower, it seems that something else in my system changed that
>>>speeded it up.)
>>>
>>>
>>>
>>>>Regards
>>>>David
>>>>
>>>>
>>>>
>>>Sincerely,
>>>Dmitri.
>>>
>>>
>>>
>>>
>
>
>


On 19Feb2004, Paul Thomas < [hidden email]> wrote:
 Octave does this sometimes; eg. j =
 1e4; tic;x = []; for i = 1:j; (i) = i; endfor; toc changes its spots
 altogether when j>1e4 (I haven't checked in the source but I presume
 that a reserve of about 8e4 bytes is being allocated?).
No, there is no reserve allocation. I think in this case, performance
degrades rapidly because you are repeatedly allocating and copying an
array that includes just one more element each time. Doing that 10000
times is bad, and it may also be that none of the previous allocations
can be reused, so that it ends up requiring a lot more than 80000
bytes in the end. (The released chunks of memory are not the right
size unless the allocator is able to combine the them into something
that is useful, and there is no guarantee that it can.)
jwe


John,
I think that you are right about this being an allocator effect. Up to
precisely 8192 words, the growth in time taken by the little code is
linear. After that, it goes over abruptly to quadratic time dependence.
The former indicates that the system is able to to reuse the previous
allocation and then, beyond 64kbyte, x has to be moved, which gives the
quadratice dependence.
Paul
John W. Eaton wrote:
>On 19Feb2004, Paul Thomas < [hidden email]> wrote:
>
> Octave does this sometimes; eg. j =
> 1e4; tic;x = []; for i = 1:j; x(i) = i; endfor; toc changes its spots
> altogether when j>1e4 (I haven't checked in the source but I presume
> that a reserve of about 8e4 bytes is being allocated?).
>
>No, there is no reserve allocation. I think in this case, performance
>degrades rapidly because you are repeatedly allocating and copying an
>array that includes just one more element each time. Doing that 10000
>times is bad, and it may also be that none of the previous allocations
>can be reused, so that it ends up requiring a lot more than 80000
>bytes in the end. (The released chunks of memory are not the right
>size unless the allocator is able to combine the them into something
>that is useful, and there is no guarantee that it can.)
>
>jwe
>
>
>
>

