Re: bugs to fix before 4.4.0 release

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Rik-4
In addition to straightforward bugs, I'd like to see the performance not
degrade too much between releases.  I know that this is a trivial test, but
the performance of double-nested for loops shows that performance has been
declining over major releases, and that the development branch is 2.6X
slower than 4.2.1.

Sample Code:

a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b + 123.0; end;
end; t1=toc(t0); t1

Results:

3.8.2 : 0.84617
4.0.3 : 1.4062
4.2.1 : 1.43
4.4.0-dev : 3.77

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Daniel Sebald
On 01/03/2018 02:54 PM, Rik wrote:

> In addition to straightforward bugs, I'd like to see the performance not
> degrade too much between releases.  I know that this is a trivial test, but
> the performance of double-nested for loops shows that performance has been
> declining over major releases, and that the development branch is 2.6X
> slower than 4.2.1.
>
> Sample Code:
>
> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b + 123.0; end;
> end; t1=toc(t0); t1
>
> Results:
>
> 3.8.2 : 0.84617
> 4.0.3 : 1.4062
> 4.2.1 : 1.43
> 4.4.0-dev : 3.77

Are you doing an apples-to-apples comparison?  E.g., all compiled on the
same system with the same configuration?  It's not the case that one of
those is a release build and the release is compile with all
optimizations or something like that?  Same thing for JIT support?

All I can say is that GUI/cli environment doesn't seem to make a
difference and the time I'm seeing for 4.4.0-dev is twice what you are
reporting (3.16GHz Xeon).

octave:1> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b +
123.0; end;
 > end; t1=toc(t0); t1
t1 =  7.3447

You are testing a really basic routine.  I wouldn't imagine the that the
arithmetic translation has varied too much.  Although perhaps the
assignment "a =" has gotten worse.  Let's try:

octave:5> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a + b +
123.0; end; end; t1=toc(t0); t1
t1 =  5.8170

A notable improvement, yet it doesn't look like the assignment is a
major drain.  How about checking loop length (I'll continue without the
assignment to sort of remove a factor):

octave:6> a = 1; b = 1; t0=tic; for i=1:1000000; for j=1:1; a + b +
123.0; end; end; t1=toc(t0); t1
t1 =  7.4740

OK, the above suggests something interesting, which is that setting up
or initializing that inner loop could be the source of the change.  So,
I'm going to guess that the other way around is pretty fast (but I
guessed wrong):

octave:7> a = 1; b = 1; t0=tic; for i=1:1; for j=1:1000000; a + b +
123.0; end; end; t1=toc(t0); t1
t1 =  5.8180

However, it's the same as the 1000 by 1000 performance.  Strange.

Check this out:

octave:1> for lim_p = 0:6
 >   lim1 = 10^lim_p;
 >   lim2 = 10^(6-lim_p);
 >   a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; a + b + 123.0;
end; end; t1=toc(t0); t1
 > end
t1 =  5.8178
t1 =  5.8183
t1 =  5.8155
t1 =  5.8219
t1 =  5.8895
t1 =  6.4637
t1 =  11.987

Guess I'd expect that to be a more linear relationship if setting up the
second loop is the major drain.  As it stands, it suggests that the "a +
b + 123.0" portion of this is, in fact, is a major consumption of time.
For comparison

octave:10> a = ones(1000000,1);
octave:11> b = ones(1000000,1);
octave:12> t0=tic; a + b + 123.0; t1=toc(t0); t1
t1 =  0.017333

Quite a difference.

Rik, I'm wondering how much the C++ compiler factors into this.  The
only fair comparison is versions of the code built with the same
compiler with the exact same settings.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Rik-4
On 01/03/2018 02:08 PM, Daniel J Sebald wrote:

> On 01/03/2018 02:54 PM, Rik wrote:
>> In addition to straightforward bugs, I'd like to see the performance not
>> degrade too much between releases.  I know that this is a trivial test, but
>> the performance of double-nested for loops shows that performance has been
>> declining over major releases, and that the development branch is 2.6X
>> slower than 4.2.1.
>>
>> Sample Code:
>>
>> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b + 123.0; end;
>> end; t1=toc(t0); t1
>>
>> Results:
>>
>> 3.8.2 : 0.84617
>> 4.0.3 : 1.4062
>> 4.2.1 : 1.43
>> 4.4.0-dev : 3.77
>
> Are you doing an apples-to-apples comparison?  E.g., all compiled on the
> same system with the same configuration?  It's not the case that one of
> those is a release build and the release is compile with all
> optimizations or something like that?  Same thing for JIT support?

Yes, all versions were self-compiled on the same machine and use the same
compile options.  No JIT.  I believe they were all the same gcc major and
minor versions as well. 

It's a bit of a mystery.  I tried compiling octave with profiling and using
oprofile but I never really got results that I could understand and that
would lead to taking some concrete action.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Michael Godfrey
Just for information, I get a very similar ratio for dev/4.2.1. This
is a bigger jump than in the past. My system is an Intel NUC7i5BNH,
i.e. I7 CPU.

I tried a few modifications, but the ratio stayed about the same.
Also, CPU usage (on 1 processor) was at just about 100% for the whole time in all
cases.

Is it not possible to get a trace on just  the a = a + b + 123.0 sequence?

Michael

On 01/03/2018 11:37 PM, Rik wrote:
3.8.2 : 0.84617
4.0.3 : 1.4062
4.2.1 : 1.43
4.4.0-dev : 3.77

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

ederag
On Thursday, January 04, 2018 00:09:31 Michael D Godfrey wrote:
> Just for information, I get a very similar ratio for dev/4.2.1. This
> is a bigger jump than in the past. My system is an Intel NUC7i5BNH,
> i.e. I7 CPU.

> On 01/03/2018 11:37 PM, Rik wrote:
> > 3.8.2 : 0.84617
> > 4.0.3 : 1.4062
> > 4.2.1 : 1.43
> > 4.4.0-dev : 3.77

same here:
4.2.1: 1.46
dev (24498:1c96b44feb7a): 3.27

build procedure described there
http://wiki.octave.org/Octave_and_separate_toolchain
except now gcc-6.4.0.
cpu: i7-5960X oc@4.0GHz

Maybe a hint: a += b + 123.0 is much faster:
a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a += b + 123.0; end; end; t1=toc(t0); t1
4.2.1: 0.98
dev (24498:1c96b44feb7a): 2.35

and
On Wednesday, January 03, 2018 16:08:27 Daniel J Sebald wrote:
> octave:1> for lim_p = 0:6
>  >   lim1 = 10^lim_p;
>  >   lim2 = 10^(6-lim_p);
>  >   a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; a + b + 123.0; end; end; t1=toc(t0); t1
>  > end
same here:
4.2.1:
t1 =  1.55
t1 =  1.44
t1 =  1.45
t1 =  1.45
t1 =  1.47
t1 =  1.66
t1 =  3.52

dev (24498:1c96b44feb7a):
t1 =  2.72
t1 =  2.61
t1 =  2.61
t1 =  2.62
t1 =  2.64
t1 =  2.87
t1 =  5.13

Ederag

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Andreas Weber-6
In reply to this post by Rik-4
Rik,

Am 04.01.2018 um 00:37 schrieb Rik:
>>> Results:
>>>
>>> 3.8.2 : 0.84617
>>> 4.0.3 : 1.4062

Are you able to build 3.8.2 and 4.0.3 with gcc 6 or are you still
running gcc 5?

I ask because I'm currently trying to build and benchmark the last 5
releases in docker but currently struggle to build 3.8.2 with gcc 6.3.0

I've found two postings on the mailinglist

Subject: Major issues with g++ 6 and gnulib
Date: Fri, 19 Feb 2016 16:33:04 -0700

Subject: Octave 4.0.2 and 4.0.3 do not build on fc24 (gcc-6.1.1)
Date: Thu, 7 Jul 2016 00:13:28 +1000

and the suggestion was to use gcc 5 for Octave <= 4.0

-- Andy

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

John W. Eaton
Administrator
In reply to this post by Michael Godfrey
On 01/03/2018 07:09 PM, Michael D Godfrey wrote:

> Just for information, I get a very similar ratio for dev/4.2.1. This
> is a bigger jump than in the past. My system is an Intel NUC7i5BNH,
> i.e. I7 CPU.
>
> I tried a few modifications, but the ratio stayed about the same.
> Also, CPU usage (on 1 processor) was at just about 100% for the whole
> time in all
> cases.
>
> Is it not possible to get a trace on just  the a = a + b + 123.0 sequence?

I haven't done any real profiling, but from my tests it looks like the
evaluation of the expression is the real problem.  Comparing

   t0=tic; for i=1:1000; for j=1:1000; end; end; t1=toc(t0); t1

in 4.2.1 and the current dev version I see nearly identical results.
Both were built with GCC 7.2.0 on a Debian system.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Carlo de Falco-2


> On 4 Jan 2018, at 16:07, John W. Eaton <[hidden email]> wrote:
>
> On 01/03/2018 07:09 PM, Michael D Godfrey wrote:
>> Just for information, I get a very similar ratio for dev/4.2.1. This
>> is a bigger jump than in the past. My system is an Intel NUC7i5BNH,
>> i.e. I7 CPU.
>> I tried a few modifications, but the ratio stayed about the same.
>> Also, CPU usage (on 1 processor) was at just about 100% for the whole time in all
>> cases.
>> Is it not possible to get a trace on just  the a = a + b + 123.0 sequence?
>
> I haven't done any real profiling, but from my tests it looks like the evaluation of the expression is the real problem.  Comparing
>
>  t0=tic; for i=1:1000; for j=1:1000; end; end; t1=toc(t0); t1
>
> in 4.2.1 and the current dev version I see nearly identical results. Both were built with GCC 7.2.0 on a Debian system.
>
> jwe
>

indeed the nested loop seems not to be related with the slowdown,
the following test with a single loop shows the same problem.

 a = b = 1; ii = 1; tic; while (ii++ < 1e6), a = a + b + 123; end; toc

here are results on macos high sierra for octave self compiled with clang++ (Apple LLVM version 9.0.0 (clang-900.0.38))

  4.2.1
  Elapsed time is 2.59538 seconds.

  4.3.0+
  Elapsed time is 7.86283 seconds.

a 3x slowdown ...

using "+=" results in a small but consistently reproducible improvement in both releases,
but the ratio between the two is even increased.

  4.2.1
  Elapsed time is 1.86918 seconds.

  4.3.0+
  Elapsed time is 6.45173 seconds.

c.
Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Andreas Weber-6
In reply to this post by Rik-4
Am 03.01.2018 um 21:54 schrieb Rik:

> In addition to straightforward bugs, I'd like to see the performance not
> degrade too much between releases.  I know that this is a trivial test, but
> the performance of double-nested for loops shows that performance has been
> declining over major releases, and that the development branch is 2.6X
> slower than 4.2.1.
>
> Sample Code:
>
> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b + 123.0; end;
> end; t1=toc(t0); t1
>
> Results:
>
> 3.8.2 : 0.84617
> 4.0.3 : 1.4062
> 4.2.1 : 1.43
> 4.4.0-dev : 3.77

My docker containers with debian jessie and gcc 4.9.2:

octave_3.6.4 : t1 =  0.93715
octave_3.8.2 : t1 =  1.0525
octave_4.0.3 : t1 =  1.3874
octave_4.2.1 : t1 =  1.4061




Reply | Threaded
Open this post in threaded view
|

Re: loop performance

Rik-4
In reply to this post by John W. Eaton
On 01/04/2018 07:07 AM, John W. Eaton wrote:

> On 01/03/2018 07:09 PM, Michael D Godfrey wrote:
>> Just for information, I get a very similar ratio for dev/4.2.1. This
>> is a bigger jump than in the past. My system is an Intel NUC7i5BNH,
>> i.e. I7 CPU.
>>
>> I tried a few modifications, but the ratio stayed about the same.
>> Also, CPU usage (on 1 processor) was at just about 100% for the whole
>> time in all
>> cases.
>>
>> Is it not possible to get a trace on just  the a = a + b + 123.0 sequence?
>
> I haven't done any real profiling, but from my tests it looks like the
> evaluation of the expression is the real problem.  Comparing
>
>   t0=tic; for i=1:1000; for j=1:1000; end; end; t1=toc(t0); t1
>
> in 4.2.1 and the current dev version I see nearly identical results. Both
> were built with GCC 7.2.0 on a Debian system.
>
> jwe

I filed a bug report (https://savannah.gnu.org/bugs/index.php?52809) to
keep track of this.  Quoting from the write-up there:


The evaluation of statements seems to be much slower on the development
branch than in previous versions.

A sample script testing just a for loop is:

tic; for i=1:1000; for j=1:1000; end; end; toc

On my machine, results are roughly equivalent between 4.2.1 and the
development branch.

4.2.1 : 0.109 seconds
dev.  : 0.121 seconds

But if there is even a single variable assignment the results are far worse.

tic; for i=1:1000; for j=1:1000; a = 1.0; end; end; toc

Results:
4.2.1 : 0.218 seconds
dev.  : 1.28 seconds

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

ederag
In reply to this post by Andreas Weber-6
On Thursday, January 04, 2018 16:00:02 Andreas Weber wrote:
> and the suggestion was to use gcc 5 for Octave <= 4.0

Mike Miller made a patch for debian.
After extraction of the patches, it was possible to
build octave 4.0.3 with gcc 6.4.0:
https://savannah.gnu.org/bugs/?32444#comment7

Ederag


Reply | Threaded
Open this post in threaded view
|

Re: bugs to fix before 4.4.0 release

Andreas Weber-6
Am 04.01.2018 um 20:50 schrieb ederag:
> On Thursday, January 04, 2018 16:00:02 Andreas Weber wrote:
>> and the suggestion was to use gcc 5 for Octave <= 4.0
>
> Mike Miller made a patch for debian.
> After extraction of the patches, it was possible to
> build octave 4.0.3 with gcc 6.4.0:
> https://savannah.gnu.org/bugs/?32444#comment7

Oh, this is great. Thank you very much
-- Andy

Reply | Threaded
Open this post in threaded view
|

building older Octave with GCC 6 (was: bugs to fix before 4.4.0 release)

Mike Miller-4
On Thu, Jan 04, 2018 at 21:21:59 +0100, Andreas Weber wrote:
> Oh, this is great. Thank you very much

If anyone is interested, I have patches to build Octave 3.4 through 4.0
with GCC 6. Probably GCC 7 as well, but untested.

--
mike

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: building older Octave with GCC 6

Andreas Weber-6
Am 04.01.2018 um 23:08 schrieb Mike Miller:
> On Thu, Jan 04, 2018 at 21:21:59 +0100, Andreas Weber wrote:
>> Oh, this is great. Thank you very much
>
> If anyone is interested, I have patches to build Octave 3.4 through 4.0
> with GCC 6. Probably GCC 7 as well, but untested.

Yes, I would be interested because I plan to build docker images
based on stretch-lite (gcc 6.3) for releases 3.4 until default.

-- Andy


Reply | Threaded
Open this post in threaded view
|

Re: building older Octave with GCC 6

Mike Miller-4
On Fri, Jan 05, 2018 at 09:01:26 +0100, Andreas Weber wrote:
> Yes, I would be interested because I plan to build docker images
> based on stretch-lite (gcc 6.3) for releases 3.4 until default.

https://bitbucket.org/snippets/mtmiller/XeEyAj

--
mike

signature.asc (849 bytes) Download Attachment