Slow performance on Linux

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

Evan Thomas-2
I have recently installed octave-1.1.1 on a PC-pentium/133 and a Sun
SPARC station 2, both with 32Mb memory. Published benchmark figures and
my own experience would suggest that the PC should be significantly
faster than the Sun. However the opposite is true. In fact, for the
calculations I've been doing, the Linux machine is unusable (almost as
if it wasn't using the floating point unit(?)).

Is this what I should expect or have I configured/done something wrong?

The system is Linux kernel 1.3.72 on top of a Debian 0.93R6
distribution. I've tried the both the binary and source distributions of
octave.

Thanks, Evan.

--
Evan Thomas
Department of Anatomy & Cell Biology
University of Melbourne
Parkville, 3052
ph: 9344-5849  fax: 9344-5818

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

John L Daschbach
>>>>> "Evan" == Evan Thomas <[hidden email]> writes:

    Evan> I have recently installed octave-1.1.1 on a PC-pentium/133
    Evan> and a Sun SPARC station 2, both with 32Mb memory. Published
    Evan> benchmark figures and my own experience would suggest that
    Evan> the PC should be significantly faster than the Sun. However
    Evan> the opposite is true. In fact, for the calculations I've
    Evan> been doing, the Linux machine is unusable (almost as if it
    Evan> wasn't using the floating point unit(?)).

    Evan> Is this what I should expect or have I configured/done
    Evan> something wrong?

    Evan> The system is Linux kernel 1.3.72 on top of a Debian 0.93R6
    Evan> distribution. I've tried the both the binary and source
    Evan> distributions of octave.

I think clearly you have done something wrong.  I have not tried it
with Linux, but it runs very well on my 486/33 16M with OS/2 and my
P120 16M with OS/2.  

It runs a lot faster on a RS6000/590 256M :-)

Actually, seat of the pants benchmarks on the RS6000 seem pretty
good. Have not checked, but the matrix inversion times seem about a
factor of 2 or a little more off from Fortran, which seems pretty
good.  

It might be useful to compile a set of Octave benchmarks, both to
compare machines and compilers.  

I compiled this some time ago, but I think it's gcc 2.7.2 and the IBM
f77 compiler. For my work, the times are fine.  I'll try some on a PC
with OS/2 and send them along.


[d3h486]:(17)>tic;foo=rand(128,128);toc
ans = 0.060529
[d3h486]:(18)>tic;bar=inv(foo);toc      
ans = 0.083243
[d3h486]:(19)>tic;foo=rand(256,256);toc
ans = 0.22873
[d3h486]:(20)>tic;bar=inv(foo);toc      
ans = 0.62577
[d3h486]:(21)>tic;foo=rand(512,512);toc
ans = 0.89869
[d3h486]:(22)>tic;bar=inv(foo);toc      
ans = 4.8428
[d3h486]:(23)>tic;foo=rand(1024,1024);toc
ans = 3.5915
[d3h486]:(24)>tic;bar=inv(foo);toc        
ans = 36.425
[d3h486]:(31)>tic;foo=rand(2048,2048);toc
ans = 14.634
[d3h486]:(32)>tic;bar=inv(foo);toc        
ans = 285.83


-John

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

Evan Thomas-2
In reply to this post by Evan Thomas-2
John L Daschbach wrote:

>
>> [original message cut]
>
> I think clearly you have done something wrong.  I have not tried it
> with Linux, but it runs very well on my 486/33 16M with OS/2 and my
> P120 16M with OS/2.
>
> It runs a lot faster on a RS6000/590 256M :-)
>
> Actually, seat of the pants benchmarks on the RS6000 seem pretty
> good. Have not checked, but the matrix inversion times seem about a
> factor of 2 or a little more off from Fortran, which seems pretty
> good.
>
> It might be useful to compile a set of Octave benchmarks, both to
> compare machines and compilers.
>
> I compiled this some time ago, but I think it's gcc 2.7.2 and the IBM
> f77 compiler. For my work, the times are fine.  I'll try some on a PC
> with OS/2 and send them along.
>
> [d3h486]:(17)>tic;foo=rand(128,128);toc
> ans = 0.060529
> [d3h486]:(18)>tic;bar=inv(foo);toc
> ans = 0.083243
> [d3h486]:(19)>tic;foo=rand(256,256);toc
> ans = 0.22873
> [d3h486]:(20)>tic;bar=inv(foo);toc
> ans = 0.62577
> [d3h486]:(21)>tic;foo=rand(512,512);toc
> ans = 0.89869
> [d3h486]:(22)>tic;bar=inv(foo);toc
> ans = 4.8428
> [d3h486]:(23)>tic;foo=rand(1024,1024);toc
> ans = 3.5915
> [d3h486]:(24)>tic;bar=inv(foo);toc
> ans = 36.425
> [d3h486]:(31)>tic;foo=rand(2048,2048);toc
> ans = 14.634
> [d3h486]:(32)>tic;bar=inv(foo);toc
> ans = 285.83
>
> -John

Here some numbers from my machines:

  Test              SPARC Station 2     Pentium133     Ratio(Sun/P133)
  ----              ---------------     ----------     ---------------
f=rand(128,128)       0.53647             0.19737         2.7
b=inv(f)              1.7881              0.40758         4.4
f=rand(512,512)       6.2866              0.84654         7.4
b=inv(f)            128.00               25.471           5.0
f=rand(1024,1024)    25.591               3.3639          7.6
b=inv(f)           1011.3               232.49            4.3

bm (defined below)   52.652             853.25         ** 0.062 **

The bm.m test basically solves 3 simple differential equations using the
lsode function. These tests were performed using octave compiled with
gcc 2.6.1 and f2c on Linux and the binary distribution on the Sun. These
are wall clock times (ie using tic and toc), but both machines are very
lightly loaded so they should closely reflect cpu times.

Can anyone suggest why lsode should be so much slower on Linux/PC?

------------ bm.m test file --------------------------------
#
# Simple ODE bench mark test for Octave
#

#
# Set up global parms, load sripts, etc for sEPSP
#

# These control the driver
global freq dur height width;
freq = 2;
dur = 1;
height = 1;
width = 0.1;

# These control the internal cascade
global alpha beta;
alpha = [1, 1, 1];
beta  = [0.5, 0.5, 0.5];

# These are the normal sEPSP boundary/range conditions
global s0 t;
s0 = [0, 0, 0, 0];
t = (0:0.1:20)';    # <<<=== change the step size to vary run time
#
# Octave script for second messanger driver.
#
# The function produces a series of square pulse
# paramterized by the values shown below.

function ret = driver(t)

  if( !is_vector(t) )
    ret = driver_single(t);
    return;
  endif

  for i = 1:length(t)
    ret (i) = driver_single(t (i));
    i++;
  endfor

  return;

endfunction

#
# The equations "describing" the second messenger cascade
#
function sdot = sEPSP(s, t)

  global alpha beta;

  sdot(1) = alpha(1)*driver(t) - beta(1)*s(1);
  sdot(2) = alpha(2)*s(1) - beta(2)*s(2);
  sdot(3) = alpha(3)*s(2) - beta(3)*s(3);

endfunction
#
# Solve equations
#

disp("Solving equations...");fflush(stdout);
epsp = lsode("sEPSP", s0, t);
---------------------- end bm.m ------------------------------------
Evan Thomas
Department of Anatomy & Cell Biology
University of Melbourne
Parkville, 3052
ph: 9344-5849  fax: 9344-5818

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

Evan Thomas-2
In reply to this post by Evan Thomas-2
Francesco Potorti` wrote:

>
> You forgot to post the definition of the driver_single function.  I
> append your code below:
>
> #
> # Simple ODE bench mark test for Octave
> #
>
> #
> # Set up global parms, load sripts, etc for sEPSP
> #
>
> # These control the driver
> global freq dur height width;
> freq = 2;
> dur = 1;
> height = 1;
> width = 0.1;
>
> # These control the internal cascade
> global alpha beta;
> alpha = [1, 1, 1];
> beta  = [0.5, 0.5, 0.5];
>
> # These are the normal sEPSP boundary/range conditions
> global s0 t;
> s0 = [0, 0, 0, 0];
> t = (0:0.1:20)';    # <<<=== change the step size to vary run time
> #
> # Octave script for second messanger driver.
> #
> # The function produces a series of square pulse
> # paramterized by the values shown below.
>
> function ret = driver(t)
>
>   if( !is_vector(t) )
>     ret = driver_single(t);
>     return;
>   endif
>
>   for i = 1:length(t)
>     ret (i) = driver_single(t (i));
>     i++;
>   endfor
>
>   return;
>
> endfunction
>
> #
> # The equations "describing" the second messenger cascade
> #
> function sdot = sEPSP(s, t)
>
>   global alpha beta;
>
>   sdot(1) = alpha(1)*driver(t) - beta(1)*s(1);
>   sdot(2) = alpha(2)*s(1) - beta(2)*s(2);
>   sdot(3) = alpha(3)*s(2) - beta(3)*s(3);
>
> endfunction
> #
> # Solve equations
> #
>
> disp("Solving equations...");fflush(stdout);
> epsp = lsode("sEPSP", s0, t);

Sorry...
---------  driver_single  -------------
# The function produces a series of square pulse
# paramterized by the values shown below.
 
 
function ret = driver_single(t)
 
  global freq dur height width;
 
  ret = 0;
 
  if( t > dur+width )
    return;
  endif
 
  b = floor(t*freq)/freq;
  if( b <= t && t < b+width )
    ret = height;
  endif
 
  return;
 
endfunction
-----------------------------------------------------
Evan Thomas
Department of Anatomy & Cell Biology
University of Melbourne
Parkville, 3052
ph: 9344-5849  fax: 9344-5818

Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

Francesco Potorti`-9
In reply to this post by Evan Thomas-2
I put together a bm.m file, modified from that posted by Evan Thomas.

I added 95% confidence interval estimates on the benchmark results.  A
first test is done to get rid of first time slowness (paging in,
loading and compiling functions), then 10 tests are run, the fastest
and slowest are removed, and the 95% confidence interval (supposing
the distribution gaussian) is computed.  If it exceeds +/-5% of the
mean, more tests are run up to a maximum of 1000, then the results are
printed.  If the confidence interval printed exceeds +/-5%, it means
that the times are very variable, probably because of other processes
running together with octave.

I prefer inversion of a known matrix rather than a random one to get
more consistent benchmark results.

I also prefer cputime() to tic() and toc(), because the first one is
less dependent on the overall load of the machine.

I set up a framework for easily adding new tests.  Currently, three of
them are done: two hadamard matrix inversions and an lsode call.  It
would be nice to substitute the sEPSP function with something simpler,
and add some more tests.  Adding test is easily done by modifying the
test(f) function and changing the nf variable.

It would also be nice to grow this to a complete benchmark for octave,
even nicer if it was made compatible with matlab.

Here are the results I obtained on an alpha machine.  I am sending two
sets of results, which mainly differ in their accuracy: the machine
was heavily loaded when I ran the first test, and I suppose it was not
when I ran the second one.  I think swapping and task switching make
for the difference.

octave> bm
test #1 0.11882 +/- 15.2% (1000 runs)
test #2 0.585831 +/- 16.2% (1000 runs)
test #3 8.59008 +/- 11.2% (1000 runs)

octave> bm
test #1 0.0997472 +/- 3.2% (10 runs)
test #2 0.46482 +/- 1.2% (10 runs)
test #3 8.19328 +/- 0.5% (10 runs)


------------------------ bm.m ----------------------
#
# Set up global parms, load sripts, etc for sEPSP
#

# These control the driver
global freq dur height width;
freq = 2;
dur = 1;
height = 1;
width = 0.1;

# These control the internal cascade
global alpha beta;
alpha = [1, 1, 1];
beta  = [0.5, 0.5, 0.5];

# These are the normal sEPSP boundary/range conditions
global s0 t;
s0 = [0, 0, 0, 0];
t = (0:0.1:20)';    # <<<=== change the step size to vary run time
#
# Octave script for second messanger driver.
#
# The function produces a series of square pulse
# paramterized by the values shown below.

function ret = driver(t)

  if( !is_vector(t) )
    ret = driver_single(t);
    return;
  endif

  for i = 1:length(t)
    ret (i) = driver_single(t (i));
    i++;
  endfor

  return;

endfunction

#
# The equations "describing" the second messenger cascade
#
function sdot = sEPSP(s, t)

  global alpha beta;

  sdot(1) = alpha(1)*driver(t) - beta(1)*s(1);
  sdot(2) = alpha(2)*s(1) - beta(2)*s(2);
  sdot(3) = alpha(3)*s(2) - beta(3)*s(3);

endfunction


# The function produces a series of square pulse
# paramterized by the values shown below.
 
 
function ret = driver_single(t)
 
  global freq dur height width;
 
  ret = 0;
 
  if( t > dur+width )
    return;
  endif
 
  b = floor(t*freq)/freq;
  if( b <= t && t < b+width )
    ret = height;
  endif
 
  return;
 
endfunction


#
# Do benchmark
#
function time = test (f)
  global s0, t;
  oldtime = cputime;
  if (f == 1)     inv (hadamard (7));
  elseif (f == 2) inv (hadamard (8));
  elseif (f == 3) lsode ("sEPSP",s0,t);
  endif;
  time = cputime - oldtime;
endfunction;

nf = 3;
targetaccuracy = 0.025;
minrepetitions = 10;
maxrepetitions = 1000;

for f = 1:nf
  test (f);  # run a first test to increase the RSS, load things and so on
  res = [];
  for rep = 1:maxrepetitions
    res(rep) = test (f);
    if (rep < minrepetitions) continue endif;
    # purged results: remove min and max elements
    purged = (res != max(res)) & (res != min(res));
    pres = res(purged);
    if (std(pres)/mean(pres) < targetaccuracy) break endif;
  endfor;
  # print 95% confidence interval
  printf ("test #%d\t\t%8g +/- %.1f%% (%d runs)\n",
          f, mean(pres), 200*std(pres)/mean(pres), rep);
endfor;


--
Francesco Potorti`                  Voice:    +39-50-593203
Satellite Network Group             Operator: +39-50-593111
CNUCE-CNR, Via Santa Maria 36       Fax:      +39-50-904052(G3)/904051(G4)
56126 Pisa - Italy                  Email:    [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

Harald Kirsch-2

>> I put together a bm.m file, modified from that posted by Evan Thomas.
>> [ stuff deleted ]

I ran the test on a linux box and also on a Solaris-box. On Solaris I
get lots of messages:
  warning: division by zero

How is it that the machines are so different? Do I not get divison by
zero on linux or is it not reported?

In addition, test #3 does not come back. Is it correct that it takes
several minutes (or even more?)

Regards,
        Harald Kirsch
-------------------------------------------------+------------------
Harald Kirsch, [hidden email], +49 721 6091 384 | This  message is
FhG/IITB,      Fraunhoferstr.1, 76131 Karlsruhe  | subject oriented.

Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

Francesco Potorti`-9
   How is it that the machines are so different? Do I not get divison
   by zero on linux or is it not reported?

I get no zero division error on alpha.

   In addition, test #3 does not come back. Is it correct that it takes
   several minutes (or even more?)

It takes 8s per test on alpha.  If the time does not converge, it will
iterate as much as 1000 times, that is 8000s (2 and half hours) on
alpha.  Set the maxrepetitions variable in bm.m to 100 (it is
currently set to 1000), which is probably a more reasonable value.
--
Francesco Potorti`                  Voice:    +39-50-593203
Satellite Network Group             Operator: +39-50-593111
CNUCE-CNR, Via Santa Maria 36       Fax:      +39-50-904052(G3)/904051(G4)
56126 Pisa - Italy                  Email:    [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

Jim Van Zandt
My results on a 90 MHz Pentium:

Octave, version 1.1.1.
Copyright (C) 1992, 1993, 1994, 1995 John W. Eaton.
This is free software with ABSOLUTELY NO WARRANTY.
For details, type `warranty'.

octave:1> bm
parse error near line 79 of file bm.m:

>>>

    ^

warning: comma in global declaration not interpreted as a command
separator
test #1          0.23125 +/- 3.1% (10 runs)
test #2           1.2576 +/- 5.0% (102 runs)
test #3         0.062327 +/- 13.6% (1000 runs)

...about twice as long for the first two tests, and over 100 times as
long for the third test.

The parse error was generated by the formfeed.

In message <[hidden email]>, Francesco Potorti` writes:
>It takes 8s per test on alpha.  If the time does not converge, it will
>iterate as much as 1000 times, that is 8000s (2 and half hours) on
>alpha.  Set the maxrepetitions variable in bm.m to 100 (it is
>currently set to 1000), which is probably a more reasonable value.

I think it makes more sense to limit the total *time* (to perhaps 100
sec) rather than the number of repetitions.  That lets you use the
same benchmark over a wider range of machines, and continue to use it
as machines speed up.

We might also think about scaling up the size of the problem rather
than the number of iterations.  (As machines get more powerful, we
tend to tackle bigger problems rather than just answering the same
size problems in less time.  This is what makes Amdahls' Law mostly
irrelevant.)  Of course, the scaling has to be understood.  If the
benchmark were "the biggest matrix that can be inverted in 60 sec",
then doubling the size of the matrix takes 8 times the processing
speed.  The "doubling" doesn't sound as impressive as the factor of 8,
though.

                          - Jim Van Zandt

Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

Francesco Potorti`-9
Jim Van Zandt writes:

   My results on a 90 MHz Pentium:

   test #1          0.23125 +/- 3.1% (10 runs)
   test #2           1.2576 +/- 5.0% (102 runs)
   test #3         0.062327 +/- 13.6% (1000 runs)

   ...about twice as long for the first two tests, and over 100 times as
   long for the third test.

No, it's just the opposite!  It is about 100 times quicker for the
third test!  How can it be?

   The parse error was generated by the formfeed.

I sometimes add comments to source files just before posting them,
this time I added a form feed too, not remembering that octave does
not take it as a blank ...

   I think it makes more sense to limit the total *time* (to perhaps 100
   sec) rather than the number of repetitions.

I added a check in order to limit the length of each test to
maxseconds (default 60).

   We might also think about scaling up the size of the problem rather
   than the number of iterations.  

Yes, this is reasonable but more difficult, because you say:

                 Of course, the scaling has to be understood.  If the
   benchmark were "the biggest matrix that can be inverted in 60 sec",
   then doubling the size of the matrix takes 8 times the processing
   speed.

and it is not always obvious how the scaling works and how precise it
is (ordinarily one discards lowest order factors).  

However, here is another bm.m.  This time I use the example in the
octave manual for the lsode function, which is very short.  It would
be interesting to know if we obtain yet such different results for
different machines with this change.

On Alpha:

bm
test #1 0.0939749 +/- 4.8% (18 runs)
test #2 0.453003 +/- 3.6% (10 runs)
test #3 1.40105 +/- 1.6% (10 runs)

------------------- bm.m ---------------------
1;  # This is not a function file
 
function xdot = xdot (x, t)
  r = 0.25; k = 1.4; a = 1.5; b = 0.16; c = 0.9; d = 0.8;
  xdot(1) = r*x(1)*(1 - x(1)/k) - a*x(1)*x(2)/(1 + b*x(1));
  xdot(2) = c*a*x(1)*x(2)/(1 + b*x(1)) - d*x(2);
endfunction

#
# Do benchmark
#
nf = 3;

function time = test(f)
  global s0, t;
  oldtime = cputime();
  if (f == 1)     inv(hadamard(7));
  elseif(f==2) inv(hadamard(8));
  elseif(f==3) lsode ("xdot",[1;2],(t=linspace(0,50,200)'));
  endif
  time = cputime() - oldtime;
endfunction

targetaccuracy = 0.025;
minrepetitions = 10;
maxrepetitions = 10000;
maxseconds = 60;

for f = 1:nf
  res = [];
  test(f);  # run a first test to increase the RSS, load things and so on
  tic();
  for rep = 1:maxrepetitions
    res(rep) = test(f);
    if (rep < minrepetitions)
      continue
    endif
    # purged results: remove min and max elements
    pres = res((res != max(res)) & (res != min(res)));
    if (std(pres)/mean(pres) < targetaccuracy || toc() > maxseconds)
      break
    endif
  endfor
  # print 95% confidence interval
  printf("test #%d\t\t%8g +/- %.1f%% (%d runs)\n", ...
          f, mean(pres), 200*std(pres)/mean(pres), rep);
endfor

Reply | Threaded
Open this post in threaded view
|

Re: Slow performance on Linux

Robert.Wilhelm

> However, here is another bm.m.  This time I use the example in the
> octave manual for the lsode function, which is very short.  It would
> be interesting to know if we obtain yet such different results for
> different machines with this change.
>
> On Alpha:
>
> bm
> test #1 0.0939749 +/- 4.8% (18 runs)
> test #2 0.453003 +/- 3.6% (10 runs)
> test #3 1.40105 +/- 1.6% (10 runs)

Which Alpha?

  On Pentium 100 (async cache)

test #1         0.206364 +/- 4.9% (13 runs)
test #2           1.0725 +/- 3.1% (10 runs)
test #3            1.876 +/- 0.6% (10 runs)  
---
Robert Wilhelm  [hidden email]  [hidden email]

Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

Francesco Potorti`-9
   Which Alpha?

Don't know.  I use it on the net.  I wanted to ask the admin, but I
cannot find him.  I do I know?

--
Francesco Potorti`                  Voice:    +39-50-593203
Satellite Network Group             Operator: +39-50-593111
CNUCE-CNR, Via Santa Maria 36       Fax:      +39-50-904052(G3)/904051(G4)
56126 Pisa - Italy                  Email:    [hidden email]

Reply | Threaded
Open this post in threaded view
|

Slow performance on Linux

John W. Eaton-6
Francesco Potorti` <[hidden email]> wrote:

: Don't know.  I use it on the net.  I wanted to ask the admin, but I
: cannot find him.  I do I know?

[I'm not sure what happened to this message -- majordomo bounced it to
me for approval, then it was chopped when I approved it.  Hmm.  --jwe]

In any case, the question was, how to tell what kind of Alpha you
have.  If you are using OSF/1, probably the simplest thing to do is to
look in /var/adm/messages.  On my system, it contains the following
line:

  Jun 28 16:32:03 bevo vmunix: AlphaStation 250 4/266 system

plus some other interesting stuff about memory and what other devices
are available.

jwe




Reply | Threaded
Open this post in threaded view
|

Benchmark for octave

Francesco Potorti`-9
   In any case, the question was, how to tell what kind of Alpha you
   have.  If you are using OSF/1, probably the simplest thing to do is to
   look in /var/adm/messages.

/var/adm/messagesis readable by root only, and it is 250KB...  Anyway,
I spoke with the admin, and it turns out it is a 2100 with four
processors and shared memory.

On a system like this (which is almost always heavily loaded) using
tic and toc instead of cputime makes no sense.  Anyway, I could modify
the bm.m script in order to use clock instead of cputime if the latter
is not available.  What does cputime do on systems where it does not
work?

--
Francesco Potorti`                  Voice:    +39-50-593203
Satellite Network Group             Operator: +39-50-593111
CNUCE-CNR, Via Santa Maria 36       Fax:      +39-50-904052(G3)/904051(G4)
56126 Pisa - Italy                  Email:    [hidden email]