For loop benchmarks between Octave versions

classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

For loop benchmarks between Octave versions

Rik-4
6/6/14

All,

I followed up one of the recent posts about differing performance between
Octave versions.  Alas, I think it is quite true.  We don't yet have 100%
test coverage for behavior, but it seems like another wish list item would
be a set of benchmarking tests so that we don't accidentally introduce
large performance regressions.

The test code in question is a double for loop.  We expect for loops to be
inefficient and slow, but at least they shouldn't get any worse than they
already are.

--- Test Code ---
a = 1; b = 1; tic; for i=1:1000; for j=1:1000; a = a + b; end; end; toc
--- End Test Code ---

I fixed the CPU frequency before running the tests to avoid differences due
to frequency scaling.  I experimented a bit with taskset to ensure the
Octave process stayed on a single core, but it didn't make a difference so
I stopped using it (I would expect this since the Linux kernel also tries
for performance reasons not to shift running jobs off of a core).  All of
the versions are locally compiled on a Linux machine using roughly the same
options of '-O2'.

Results:

Version    3.2.4    3.4.3    3.6.4    3.8.0    3.8.1    gui-release
(no-gui-libs)    dev (no-gui-libs)    dev (gui)

     2.24418    1.71425    2.95948    3.4683    3.15712    6.50268  
 7.00682    8.09868
    2.24436    1.71648    2.97603    3.43708    3.15531    6.47645  
 6.93558    8.0038
    2.25401    1.7231    2.96663    3.46252    3.19536    6.46978  
 6.94941    8.07056
                               
Mean    2.24752    1.71794    2.96738    3.45597    3.16926    6.48297  
 6.96394    8.05768

There was a doubling in execution time from 3.4.3 to 3.6.4, mostly
unchanged from 3.6.4 to 3.8.1, and then another doubling on the gui-release
branch.  The gui-release branch and other releases going forward is
compiled with atomic refcounting which is necessary for the GUI, but does
have a performance penalty.  I'm re-compiling the gui-release branch with
atomic refcounting disabled to see how much it improves things, but I don't
have high hopes.  The last time I benchmarked this it was only a 10%
difference, rather than the 100% difference that the tests show.

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

PrasannaKumar Muralidharan
The same test case took 2.35s in 3.6.4 while it took around 3.9s in
3.7.5. So bisected Octave code between release 3.6.4 and 3.8.1.

CPU details:
Intel Core i5. Clock speed set to 1.2GHz.

The time taken to run the test case:
********************************************************************************
3.7.5 (did not note hg revision) -> 3.9s
16650:18d3c1b981e7 (3.7.3+) -> 4.13013s
16412:61989cde13ae (3.7.2+) -> 3.5439s
16295:4a1300ed5d3c (3.7.2+) -> 3.96383s
16216:70c47da7e02b (3.7.2+) -> 3.99449s
16183:359d56094efa (3.7.2+) -> 4.15015s
16168:8650eec57e9f (3.7.2+) -> 3.86547s
16161:b672afbb7c3c (3.7.2+) -> 4.06835s
16157:335041cc657a (3.7.2+) -> 4.06872s
16089:8a8e04aa3c98 (3.6.4) -> 2.34667s
16156:236be6179785 (3.7.2+) -> 3.85227s
********************************************************************************

Hg bisect result:
********************************************************************************
The first bad revision is:
changeset:   16156:236be6179785
parent:      16154:aa5e1e8dce66
parent:      16089:8a8e04aa3c98
user:        John W. Eaton <[hidden email]>
date:        Thu Feb 28 02:25:44 2013 -0500
summary:     maint: periodic merge of stable to default

Not all ancestors of this changeset have been checked.
Use bisect --extend to continue the bisection from
the common ancestor, 6a44bb3c9593.
********************************************************************************

This result can be used to do further analysis.

Note:
I am continuing with the bisect further to find the change that causes
the regression.

Hope this helps,
PrasannaKumar

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Rik-4
On 06/15/2014 02:46 AM, PrasannaKumar Muralidharan wrote:

> The same test case took 2.35s in 3.6.4 while it took around 3.9s in
> 3.7.5. So bisected Octave code between release 3.6.4 and 3.8.1.
>
> CPU details:
> Intel Core i5. Clock speed set to 1.2GHz.
>
> The time taken to run the test case:
> ********************************************************************************
> 3.7.5 (did not note hg revision) -> 3.9s
> 16650:18d3c1b981e7 (3.7.3+) -> 4.13013s
> 16412:61989cde13ae (3.7.2+) -> 3.5439s
> 16295:4a1300ed5d3c (3.7.2+) -> 3.96383s
> 16216:70c47da7e02b (3.7.2+) -> 3.99449s
> 16183:359d56094efa (3.7.2+) -> 4.15015s
> 16168:8650eec57e9f (3.7.2+) -> 3.86547s
> 16161:b672afbb7c3c (3.7.2+) -> 4.06835s
> 16157:335041cc657a (3.7.2+) -> 4.06872s
> 16089:8a8e04aa3c98 (3.6.4) -> 2.34667s
> 16156:236be6179785 (3.7.2+) -> 3.85227s
> ********************************************************************************
>
> Hg bisect result:
> ********************************************************************************
> The first bad revision is:
> changeset:   16156:236be6179785
> parent:      16154:aa5e1e8dce66
> parent:      16089:8a8e04aa3c98
> user:        John W. Eaton <[hidden email]>
> date:        Thu Feb 28 02:25:44 2013 -0500
> summary:     maint: periodic merge of stable to default
>
> Not all ancestors of this changeset have been checked.
> Use bisect --extend to continue the bisection from
> the common ancestor, 6a44bb3c9593.
> ********************************************************************************
>
> This result can be used to do further analysis.
>
> Note:
> I am continuing with the bisect further to find the change that causes
> the regression.
>
> Hope this helps,
> PrasannaKumar
Thanks for working on this issue.  Since this was just a merge of code it
is most likely that the changeset which caused the problem is still
backwards in time on either the stable or development branch.  Maybe follow
Mercurial's guidance and continue with 'hg bisect --extend' to see if it
can locate it.

I see that the parent revision of the merge was 16089 which was just a
changeset to tag the 3.6.4 release which was itself 16088.  Since we know
that 3.6.4 had an acceptable running time, the problem changeset is likely
to be on the development branch.

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

PrasannaKumar Muralidharan
> Thanks for working on this issue.  Since this was just a merge of code it
> is most likely that the changeset which caused the problem is still
> backwards in time on either the stable or development branch.  Maybe follow
> Mercurial's guidance and continue with 'hg bisect --extend' to see if it
> can locate it.

Trying this but having some issues (explained below). Any help will be
appreciated.

> I see that the parent revision of the merge was 16089 which was just a
> changeset to tag the 3.6.4 release which was itself 16088.  Since we know
> that 3.6.4 had an acceptable running time, the problem changeset is likely
> to be on the development branch.

I tried "hg bisect --extend" as hg suggested. The revision 16065 is
the next version to be checked. But I am not able to compile it, error
is "cannot find oct-parse.h". Don't know how to skip this revision.
"hg bisect --extend --skip" does not work. It will be great if someone
can help me in skipping this revision.

Thanks and Regards,
PrasannaKumar

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

PrasannaKumar Muralidharan
In reply to this post by Rik-4
Is there a way to reduce the compile time? Every make takes around 30
minutes in my machine. Every make is compiling the full code from
scratch.

Possible reasons may be 1) there are lot of changes 2) running
configure scripts creates a newmake file everytime which causes the
make to recompile full code.

Thanks and Regards,
PrasannaKumar

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

bpabbott
Administrator
On Jun 15, 2014, at 1:23 PM, PrasannaKumar Muralidharan <[hidden email]> wrote:

> Is there a way to reduce the compile time? Every make takes around 30
> minutes in my machine. Every make is compiling the full code from
> scratch.
>
> Possible reasons may be 1) there are lot of changes 2) running
> configure scripts creates a newmake file everytime which causes the
> make to recompile full code.
>
> Thanks and Regards,
> PrasannaKumar

You don't need to run "configure" each time you want to compile.  Just run "make".

If configure is running automatically, what are your configure options?  Specifically did you include "--disable-dependency-tracking" as an option to configure?

Ben


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Rik-4
On 06/15/2014 10:29 AM, Ben Abbott wrote:

> On Jun 15, 2014, at 1:23 PM, PrasannaKumar Muralidharan <[hidden email]> wrote:
>
>> Is there a way to reduce the compile time? Every make takes around 30
>> minutes in my machine. Every make is compiling the full code from
>> scratch.
>>
>> Possible reasons may be 1) there are lot of changes 2) running
>> configure scripts creates a newmake file everytime which causes the
>> make to recompile full code.
>>
>> Thanks and Regards,
>> PrasannaKumar
> You don't need to run "configure" each time you want to compile.  Just run "make".
>
> If configure is running automatically, what are your configure options?  Specifically did you include "--disable-dependency-tracking" as an option to configure?
>
> Ben
>
>
Also, I use ccache which helps a lot.  For a version I have built most of
before, such as the development tip, the compile time is usually reduced to
about 3 minutes.  If I have to compile everything from scratch then it is,
indeed, about 30 minutes.  Unfortunately bisecting means going back to lots
of versions of code which aren't in the cache so you get less benefit.

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Rik-4
In reply to this post by PrasannaKumar Muralidharan
On 06/15/2014 02:46 AM, PrasannaKumar Muralidharan wrote:

> The same test case took 2.35s in 3.6.4 while it took around 3.9s in
> 3.7.5. So bisected Octave code between release 3.6.4 and 3.8.1.
>
> CPU details:
> Intel Core i5. Clock speed set to 1.2GHz.
>
> The time taken to run the test case:
> ********************************************************************************
> 3.7.5 (did not note hg revision) -> 3.9s
> 16650:18d3c1b981e7 (3.7.3+) -> 4.13013s
> 16412:61989cde13ae (3.7.2+) -> 3.5439s
> 16295:4a1300ed5d3c (3.7.2+) -> 3.96383s
> 16216:70c47da7e02b (3.7.2+) -> 3.99449s
> 16183:359d56094efa (3.7.2+) -> 4.15015s
> 16168:8650eec57e9f (3.7.2+) -> 3.86547s
> 16161:b672afbb7c3c (3.7.2+) -> 4.06835s
> 16157:335041cc657a (3.7.2+) -> 4.06872s
> 16089:8a8e04aa3c98 (3.6.4) -> 2.34667s
> 16156:236be6179785 (3.7.2+) -> 3.85227s
> ********************************************************************************
>
> Hg bisect result:
> ********************************************************************************
> The first bad revision is:
> changeset:   16156:236be6179785
> parent:      16154:aa5e1e8dce66
> parent:      16089:8a8e04aa3c98
> user:        John W. Eaton <[hidden email]>
> date:        Thu Feb 28 02:25:44 2013 -0500
> summary:     maint: periodic merge of stable to default
>
> Not all ancestors of this changeset have been checked.
> Use bisect --extend to continue the bisection from
> the common ancestor, 6a44bb3c9593.
> ********************************************************************************
>
> This result can be used to do further analysis.
>
> Note:
> I am continuing with the bisect further to find the change that causes
> the regression.
>
> Hope this helps,
> PrasannaKumar
>
I find that this changeset:

12920:5d18231eee00 Extend profiling support to operators.

produces a 50% slow down when using the for loop benchmark.  Unless we can
figure out a lower footprint way to do profiling we might be stuck with
this part of the slowdown.  There is still another 50% which seems to be
due to other accumulated cruft, i.e., I didn't find a smoking gun changeset.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

ijourneaux
I took the time to try out the various Mingw versions I have. The for loop test code yielded (averaged over 5 runs). Win7 Core i5-3470 at 3.20GHz

3.4.3         0.796s
3.6.4         1.030s
3.8.2rc1     1.112s   (gui)
3.8.2rc1     1.024s   (cli)

Just another data point.
Ian
Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Julien Bect
In reply to this post by Rik-4
Le 17/06/2014 02:39, Rik a écrit :
> I find that this changeset:
>
> 12920:5d18231eee00 Extend profiling support to operators.
>
> produces a 50% slow down when using the for loop benchmark. Unless we
> can figure out a lower footprint way to do profiling we might be stuck
> with this part of the slowdown.

The attached patch significantly improves the situation for me.

On the gui-release branch, the runtime goes from 2.75 seconds to 2.16
seconds.

Not a 50% improvement, though...

@++
Julien

SpeedupInactiveProfiler.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

John W. Eaton
Administrator
On 06/17/2014 09:29 AM, Julien Bect wrote:

> Le 17/06/2014 02:39, Rik a écrit :
>> I find that this changeset:
>>
>> 12920:5d18231eee00 Extend profiling support to operators.
>>
>> produces a 50% slow down when using the for loop benchmark. Unless we
>> can figure out a lower footprint way to do profiling we might be stuck
>> with this part of the slowdown.
>
> The attached patch significantly improves the situation for me.
>
> On the gui-release branch, the runtime goes from 2.75 seconds to 2.16
> seconds.
>
> Not a 50% improvement, though...

Does declaring is_active "inline" really make a difference?  I'd be
surprised if that is not always inlined when the compiler is performing
inlining optimizations.

I don't much like having to manually match up new/delete, so I'd rather
see that go inside a constructor/destructor pair.  Done properly and
with inlining, I don't see why it would be inefficient and perhaps we
could eliminate the BEGIN/END macros?

jwe


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Julien Bect
Le 17/06/2014 16:49, John W. Eaton a écrit :

> On 06/17/2014 09:29 AM, Julien Bect wrote:
>> Le 17/06/2014 02:39, Rik a écrit :
>>> I find that this changeset:
>>>
>>> 12920:5d18231eee00 Extend profiling support to operators.
>>>
>>> produces a 50% slow down when using the for loop benchmark. Unless we
>>> can figure out a lower footprint way to do profiling we might be stuck
>>> with this part of the slowdown.
>>
>> The attached patch significantly improves the situation for me.
>>
>> On the gui-release branch, the runtime goes from 2.75 seconds to 2.16
>> seconds.
>>
>> Not a 50% improvement, though...
>
> Does declaring is_active "inline" really make a difference?  I'd be
> surprised if that is not always inlined when the compiler is
> performing inlining optimizations.

Yes, I have tried both, and it does make a (small) difference.

The main difference was obtained by calling profiler.is_active directly
(2.75 -> 2.26, roughly).

The runtime dropped to 2.16 when I added inline.

> I don't much like having to manually match up new/delete, so I'd
> rather see that go inside a constructor/destructor pair.  Done
> properly and with inlining, I don't see why it would be inefficient
> and perhaps we could eliminate the BEGIN/END macros?

I try that right now.



Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Daniel Sebald
On 06/17/2014 10:11 AM, Julien Bect wrote:

> Le 17/06/2014 16:49, John W. Eaton a écrit :
>> On 06/17/2014 09:29 AM, Julien Bect wrote:
>>> Le 17/06/2014 02:39, Rik a écrit :
>>>> I find that this changeset:
>>>>
>>>> 12920:5d18231eee00 Extend profiling support to operators.
>>>>
>>>> produces a 50% slow down when using the for loop benchmark. Unless we
>>>> can figure out a lower footprint way to do profiling we might be stuck
>>>> with this part of the slowdown.
>>>
>>> The attached patch significantly improves the situation for me.
>>>
>>> On the gui-release branch, the runtime goes from 2.75 seconds to 2.16
>>> seconds.
>>>
>>> Not a 50% improvement, though...
>>
>> Does declaring is_active "inline" really make a difference? I'd be
>> surprised if that is not always inlined when the compiler is
>> performing inlining optimizations.
>
> Yes, I have tried both, and it does make a (small) difference.
>
> The main difference was obtained by calling profiler.is_active directly
> (2.75 -> 2.26, roughly).
>
> The runtime dropped to 2.16 when I added inline.

I'm surprised that made a difference.  I believe that functions defined
within the class declaration are implicitly inline.  "inline" must have
additional meaning to the compiler.  For example, keeping the return
variable in a register.  Odd.


>> I don't much like having to manually match up new/delete, so I'd
>> rather see that go inside a constructor/destructor pair. Done properly
>> and with inlining, I don't see why it would be inefficient and perhaps
>> we could eliminate the BEGIN/END macros?
>
> I try that right now.

Also try this construct please:

#define END_PROFILER_BLOCK \
     if (pe) {delete (pe);} \
    }

Yes, I realize that "delete" will check that the pointer is non-null,
but I'm curious if perhaps delete is a function call rather than an
inline, and if so does an unnecessary jump.

Thanks,

Dan

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

John W. Eaton
Administrator
In reply to this post by Julien Bect
On 06/17/2014 11:11 AM, Julien Bect wrote:
> Le 17/06/2014 16:49, John W. Eaton a écrit :

>> Does declaring is_active "inline" really make a difference?  I'd be
>> surprised if that is not always inlined when the compiler is
>> performing inlining optimizations.
>
> Yes, I have tried both, and it does make a (small) difference.
 >
> The main difference was obtained by calling profiler.is_active directly
> (2.75 -> 2.26, roughly).
>
> The runtime dropped to 2.16 when I added inline.

Consistently?

Could you look at the generated assembly code and see whether the
"inline" keyword is really needed to get this trivial function to be
inlined?  What compiler optimization flags are you using?

jwe


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Jordi Gutiérrez Hermoso-2
On Tue, 2014-06-17 at 11:41 -0400, John W. Eaton wrote:
> Could you look at the generated assembly code and see whether the
> "inline" keyword is really needed to get this trivial function to be
> inlined?  What compiler optimization flags are you using?

And, of course, what compiler?

- Jordi G. H.



Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Julien Bect
In reply to this post by John W. Eaton
Le 17/06/2014 17:41, John W. Eaton a écrit :
> On 06/17/2014 11:11 AM, Julien Bect wrote:
>> The runtime dropped to 2.16 when I added inline.
> Consistently?

Well, as far as I can tell from the small number of tests that I did :
yes, consistently.

> Could you look at the generated assembly code and see whether the
> "inline" keyword is really needed to get this trivial function to be
> inlined?

I'm afraid that's beyond my skills.

> What compiler optimization flags are you using?

I use gcc 4.7.3 i686-linux-gnu on Ubuntu 13.04

I have configured with --enable-jit --disable-doc

and didn't change any other option.


Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

PrasannaKumar Muralidharan
In reply to this post by Rik-4

Benchmark using the other software will be a good case for comparison. Can anyone try and post the result?

Thanks and Regards,
Prasanna Kumar

Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Jordi Gutiérrez Hermoso-2
On Wed, 2014-06-18 at 08:20 +0530, PrasannaKumar Muralidharan wrote:
> Benchmark using the other software will be a good case for comparison. Can
> anyone try and post the result?

You mean Matlab? (Just say its name, it's not forbidden).

A comparison with Matlab won't be very helpful. We know its JIT
compiler is order of magnitudes faster than Octave's.

- Jordi G. H.



Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

John Swensen-3

On Tue, Jun 17, 2014 at 11:34 PM, Jordi Gutiérrez Hermoso <[hidden email]> wrote:
On Wed, 2014-06-18 at 08:20 +0530, PrasannaKumar Muralidharan wrote:
> Benchmark using the other software will be a good case for comparison. Can
> anyone try and post the result?

You mean Matlab? (Just say its name, it's not forbidden).

A comparison with Matlab won't be very helpful. We know its JIT
compiler is order of magnitudes faster than Octave's.

- Jordi G. H.




Awe, come on! I kindof enjoy the Voldemort-esque allusion to Matlab on the Octave mailing lists. 
Reply | Threaded
Open this post in threaded view
|

Re: For loop benchmarks between Octave versions

Julien Bect
In reply to this post by Rik-4
Le 17/06/2014 02:39, Rik a écrit :
> I find that this changeset:
>
> 12920:5d18231eee00 Extend profiling support to operators.
>
> produces a 50% slow down when using the for loop benchmark. Unless we
> can figure out a lower footprint way to do profiling we might be stuck
> with this part of the slowdown. There is still another 50% which seems
> to be due to other accumulated cruft, i.e., I didn't find a smoking
> gun changeset.

On the gui-release branch, I notice a clear 10% increase on the for loop
benchmark between

     1ec884e5ff00   waitbar.m: Force pixel units for waitbar figure (bug
#41645)
http://hg.savannah.gnu.org/hgweb/octave/rev/1ec884e5ff00

and

     554be77a60fb   Add FT2_CFLAGS to CPPFLAGS required for new Qt graphics.
http://hg.savannah.gnu.org/hgweb/octave/rev/554be77a60fb.

All changesets inbetween are related to QtHandles, especially this one :

     18498:2e7cad6f180c   Initial integration of QtHandles.
http://hg.savannah.gnu.org/hgweb/octave/rev/2e7cad6f180c

Does anyone have a clue on why the integration of QtHandles might have
affected the computation time of a simple loop ?

(Note : all tests done with ./run-octave --no-gui on Ubuntu 13.04 with
gcc 4.7.3 i686-linux-gnu, --enable-jit)

123