On 01/03/2018 02:54 PM, Rik wrote:

> In addition to straightforward bugs, I'd like to see the performance not

> degrade too much between releases. I know that this is a trivial test, but

> the performance of double-nested for loops shows that performance has been

> declining over major releases, and that the development branch is 2.6X

> slower than 4.2.1.

>

> Sample Code:

>

> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b + 123.0; end;

> end; t1=toc(t0); t1

>

> Results:

>

> 3.8.2 : 0.84617

> 4.0.3 : 1.4062

> 4.2.1 : 1.43

> 4.4.0-dev : 3.77

Are you doing an apples-to-apples comparison? E.g., all compiled on the

same system with the same configuration? It's not the case that one of

those is a release build and the release is compile with all

optimizations or something like that? Same thing for JIT support?

All I can say is that GUI/cli environment doesn't seem to make a

difference and the time I'm seeing for 4.4.0-dev is twice what you are

reporting (3.16GHz Xeon).

octave:1> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a = a + b +

123.0; end;

> end; t1=toc(t0); t1

t1 = 7.3447

You are testing a really basic routine. I wouldn't imagine the that the

arithmetic translation has varied too much. Although perhaps the

assignment "a =" has gotten worse. Let's try:

octave:5> a = 1; b = 1; t0=tic; for i=1:1000; for j=1:1000; a + b +

123.0; end; end; t1=toc(t0); t1

t1 = 5.8170

A notable improvement, yet it doesn't look like the assignment is a

major drain. How about checking loop length (I'll continue without the

assignment to sort of remove a factor):

octave:6> a = 1; b = 1; t0=tic; for i=1:1000000; for j=1:1; a + b +

123.0; end; end; t1=toc(t0); t1

t1 = 7.4740

OK, the above suggests something interesting, which is that setting up

or initializing that inner loop could be the source of the change. So,

I'm going to guess that the other way around is pretty fast (but I

guessed wrong):

octave:7> a = 1; b = 1; t0=tic; for i=1:1; for j=1:1000000; a + b +

123.0; end; end; t1=toc(t0); t1

t1 = 5.8180

However, it's the same as the 1000 by 1000 performance. Strange.

Check this out:

octave:1> for lim_p = 0:6

> lim1 = 10^lim_p;

> lim2 = 10^(6-lim_p);

> a = 1; b = 1; t0=tic; for i=1:lim1; for j=1:lim2; a + b + 123.0;

end; end; t1=toc(t0); t1

> end

t1 = 5.8178

t1 = 5.8183

t1 = 5.8155

t1 = 5.8219

t1 = 5.8895

t1 = 6.4637

t1 = 11.987

Guess I'd expect that to be a more linear relationship if setting up the

second loop is the major drain. As it stands, it suggests that the "a +

b + 123.0" portion of this is, in fact, is a major consumption of time.

For comparison

octave:10> a = ones(1000000,1);

octave:11> b = ones(1000000,1);

octave:12> t0=tic; a + b + 123.0; t1=toc(t0); t1

t1 = 0.017333

Quite a difference.

Rik, I'm wondering how much the C++ compiler factors into this. The

only fair comparison is versions of the code built with the same

compiler with the exact same settings.

Dan