move constructors likely a requirement

classic Classic list List threaded Threaded
39 messages Options
12
Reply | Threaded
Open this post in threaded view
|

move constructors likely a requirement

Rik-4
jwe, Dan,

I looked closely at the 9 instances of the ~octave_value_list destructor
being called for the simple expression "sin (1);".  I think I understand
the problematic coding pattern, but I haven't yet found a solution.

For didactic purposes, assume there is a complicated class (comp_class)
which has an extensive constructor/destructor or otherwise uses a lot of
resources.  There is also a function main which uses an instance of
comp_class, but the initialization of the instance is done by a separate
function init to make the code more modular and readable.

main ()
{
  comp_class cobj;

  cobj = init ();

  ...
}

comp_class init (void)
{
  comp_class retval;  // local variable

  retval.xxx = yyy;   // initialization
  ...

  return retval;      // return by value (a comp_class object)
}

The sequence of events for the C++ runtime is

1) local variable cobj is created in main calling comp_class constructor
2) init() is called
3) local variable retval is created in main calling comp_class constructor
4) "return retval;" statement causes comp_class copy constructor to execute
copying local variable retval in to a temporary new comp_class object
because the function init returns by value.
5) init() routine completes, local variable retval goes out of scope and
comp_class destructor is called
6) back in main(), destructor for cobj is called to clear object before
assignment.
7) copy assignment operator transfers temporary comp_class object to cobj.
8) temporary value goes out of scope and calls comp_class destructor

That is a huge number of objects created/destroyed to handle what is a
fairly simple coding pattern.  This example is not dreamt up, but reflects
the coding in pt-eval.cc.  The function
tree_evaluator::visit_index_expression is coded like so

octave_value_list first_args;
...
first_args = convert_to_const_vector (al);

where convert_to_const_vector is declared as

octave_value_list
tree_evaluator::convert_to_const_vector (tree_argument_list *arg_list,
                                           const octave_value *object)

One obvious solution would be to discard returning by value.  In that case,
perhaps passing a reference to the octave_value_list in to the child
function so that it operates directly on the only instance we care to
create.  But, this would mean changing a fair number of APIs and
potentially a lot of code refactoring.

The next most obvious thing would be to use the C++11 feature of move
constructors and move assignment operators.  I added those two functions to
ovl.h along with some logging.  Now when I run I get

sin(1);
octave_value_list: move constructor 893
~octave_value_list: 7772
~octave_value_list: 7773
octave_value_list: move assignment 3178
~octave_value_list: 7774
octave_value_list: move assignment 3179
~octave_value_list: 7775
octave_value_list: move assignment 3180
~octave_value_list: 7776
octave_value_1111111111list: move assignment 3181
~octave_value_list: 7777
~octave_value_list: 7778
~octave_value_list: 7779
octave_value_list: move assignment 3182
~octave_value_list: 7780

Clearly, the move routines themselves are getting called by the C++ runtime
in both flavors (constructor and assignment).  However, there is no
decrease in the number of times (9) the ~octave_value_destructor is called
because I couldn't find a clean way to "zero out" the rvalue object that is
being moved.  Here is the assignment routine I have at the moment:

  octave_value_list& operator = (octave_value_list&& obj)
  {
    static long int n = 0;
    std::cerr << "octave_value_list: move assignment " << ++n << std::endl;

    if (this == &obj)
      return *this;

    std::swap (m_data, obj.m_data);
    std::swap (m_names, obj.m_names);

    return *this;
  }

Maybe this is actually okay, but it doesn't actually result in any
speed-up.  It seems to me that the octave_value class may also need to have
move functions written at the same time.  An octave_value is also a
"complicated class" and there are many functions which return by value an
octave_value instance.  Given that this requires creating a temporary, and
the atomic increment/decrement operators are slow, move routines in that
class may be more important.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Carlo de Falco-2
Hi,

> Il giorno 21 ago 2019, alle ore 00:59, Rik <[hidden email]> ha scritto:
>
> main ()
> {
>   comp_class cobj;
>
>   cobj = init ();
>
>   ...
> }
>
> comp_class init (void)
> {
>   comp_class retval;  // local variable
>
>   retval.xxx = yyy;   // initialization
>   ...
>
>   return retval;      // return by value (a comp_class object)
> }

This may be a naive comment as it is based only on your super-simplified
and I didn't look at the actual code, but wouldn't it be more simple and efficient
to change the code pattern above to the following?


main ()
{
  comp_class cobj;

  init (cobj);

  ...
}

void init (comp_class& retval)
{
  //comp_class retval;  // local variable, no longer needed

  retval.xxx = yyy;   // initialization
  ...

  //return retval;      // return by value (a comp_class object), no longer needed
}


this is mainly out of curiosity, sorry for the noise if it makes no sense ...

c.
Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Carlo de Falco-2


> Il giorno 21 ago 2019, alle ore 07:03, Carlo De Falco <[hidden email]> ha scritto:
>
> Hi,
>
>> Il giorno 21 ago 2019, alle ore 00:59, Rik <[hidden email]> ha scritto:
>>
>> main ()
>> {
>>  comp_class cobj;
>>
>>  cobj = init ();
>>
>>  ...
>> }
>>
>> comp_class init (void)
>> {
>>  comp_class retval;  // local variable
>>
>>  retval.xxx = yyy;   // initialization
>>  ...
>>
>>  return retval;      // return by value (a comp_class object)
>> }
>
> This may be a naive comment as it is based only on your super-simplified
> and I didn't look at the actual code, but wouldn't it be more simple and efficient
> to change the code pattern above to the following?
>
>
> main ()
> {
>  comp_class cobj;
>
>  init (cobj);
>
>  ...
> }
>
> void init (comp_class& retval)
> {
>  //comp_class retval;  // local variable, no longer needed
>
>  retval.xxx = yyy;   // initialization
>  ...
>
>  //return retval;      // return by value (a comp_class object), no longer needed
> }
>
>
> this is mainly out of curiosity, sorry for the noise if it makes no sense ...
>
> c.

OH, I see now that you had already suggested that in your message ...

>> One obvious solution would be to discard returning by value.  In that case,
>> perhaps passing a reference to the octave_value_list in to the child
>> function so that it operates directly on the only instance we care to
>> create.  But, this would mean changing a fair number of APIs and
>> potentially a lot of code refactoring.

so this answer my question already ...

c.


Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Daniel Sebald
In reply to this post by Carlo de Falco-2
On 8/21/19 1:03 AM, Carlo De Falco wrote:

> Hi,
>
>> Il giorno 21 ago 2019, alle ore 00:59, Rik <[hidden email]> ha scritto:
>>
>> main ()
>> {
>>    comp_class cobj;
>>
>>    cobj = init ();
>>
>>    ...
>> }
>>
>> comp_class init (void)
>> {
>>    comp_class retval;  // local variable
>>
>>    retval.xxx = yyy;   // initialization
>>    ...
>>
>>    return retval;      // return by value (a comp_class object)
>> }
>
> This may be a naive comment as it is based only on your super-simplified
> and I didn't look at the actual code, but wouldn't it be more simple and efficient
> to change the code pattern above to the following?
>
>
> main ()
> {
>    comp_class cobj;
>
>    init (cobj);
>
>    ...
> }
>
> void init (comp_class& retval)
> {
>    //comp_class retval;  // local variable, no longer needed
>
>    retval.xxx = yyy;   // initialization
>    ...
>
>    //return retval;      // return by value (a comp_class object), no longer needed
> }
>
>
> this is mainly out of curiosity, sorry for the noise if it makes no sense ...
>
> c.
>

It makes sense; it's one of the alternatives Rik mentioned: "return by
reference".  That would be my preference, I guess.  I spent decades
thinking in terms of pointers.  Then references came along and I was
alright with them--just a little extra sheen.  But as features have been
added to C++, I always feel a bit more detached and unsure exactly how
efficient something compiles.

In C and early C++ the return options were limited to basic variables,
and often there were rules about just what particular registers these
variable types were stored in... enabling one to be even more efficient
at things if the values could be kept within the register without having
to go to memory/stack and back.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
In reply to this post by Rik-4
On 8/20/19 6:59 PM, Rik wrote:

> However, there is no
> decrease in the number of times (9) the ~octave_value_destructor is called
> because I couldn't find a clean way to "zero out" the rvalue object that is
> being moved.

>      std::swap (m_data, obj.m_data);
>      std::swap (m_names, obj.m_names);

I think these should be

   m_data = std::move (obj.m_data);
   m_names = std::move (obj.m_names);

and not calls to std::swap.

But in any case, whether a copy or move happens, my understanding is
that the destructor for the moved object will still be called, but it
doesn't have to do anything.

For example, for a reference-counted class like octave_value, you would
do something like

   octave_value (octave_value&& other)
     : rep (other.rep)
   {
     other.rep = nullptr;
   }

to transfer the ownership of REP from the OTHER to *THIS instead of
manipulating the reference count.  Then the destructor should also be
changed so that it will skip decrementing the reference count if REP is
nullptr:

   ~octave_value (void)
   {
     if (rep && --rep->count == 0)
       delete rep;
   }

Currently, we just have "if (--rep->count == 0)" but that will fail if
we introduce a move constructor.

Instead of looking at whether the destructor is called, maybe it would
be better to trace how many times atomic variables are accessed and
modified?

I have been experimenting with the attached patches:

   diffs-1.txt: Change the string_vector class so it contains an
Array<std::string> object instead of deriving from Array<std::string>
(next step is to use some other container instead of the heavyweight
Array<T>).

   diffs-2.txt: Introduce move constructors and assignment operators for
the Array, string_vector and octave_value_list classses.

   diffs-3.txt: Improve the tree_evaluator::convert_to_const_vecctor
function so that it doesn't create as many octave_value_list objects.

I think it is worth attempting to add move constructors and assignment
operators to any classes we define that need to be copied.  As we do for
copy constructors and assignment operators, it would also make sense to
use the "= default" specification where possible so that the compiler
can generate these functions, or to use "= delete" to explicitly state
that they should not be defined.

jwe

diffs-1.txt (7K) Download Attachment
diffs-2.txt (5K) Download Attachment
diffs-3.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Rik-4
On 08/21/2019 01:23 PM, John W. Eaton wrote:

> On 8/20/19 6:59 PM, Rik wrote:
>
>> However, there is no
>> decrease in the number of times (9) the ~octave_value_destructor is called
>> because I couldn't find a clean way to "zero out" the rvalue object that is
>> being moved.
>
>>      std::swap (m_data, obj.m_data);
>>      std::swap (m_names, obj.m_names);
>
> I think these should be
>
>   m_data = std::move (obj.m_data);
>   m_names = std::move (obj.m_names);
>
> and not calls to std::swap.

I made them std::swap because I couldn't figure out how to easily zero out
the std::vector<octave_value> that is a data member of octave_value_list. 
If an instance of octave_value_list already has data in it, and move
assignment is called, then the existing data has to be dealt with in some
manner.  It was only an idea, but I swapped the data from the existing
object with the incoming data from the rvalue temporary.  This accomplished
the goal of having the incoming data supplant the existing data, and when
the rvalue object goes out of scope it calls the destructor which takes
care of clearing any data that was already present.  The move constructor
in your diffs is

+  octave_value_list& operator = (octave_value_list&& obj)
+  {
+    if (this != &obj)
+      {
+        m_data = std::move (obj.m_data);
+        m_names = std::move (obj.m_names);
+      }
+
+    return *this;
+  }

Wouldn't you need a call such as "m_data.clear ();" to get rid of any
existing data before assigning to it?

>
> But in any case, whether a copy or move happens, my understanding is that
> the destructor for the moved object will still be called, but it doesn't
> have to do anything.

I wrote some test code to verify and the destructor for the rvalue object
is always called.

> For example, for a reference-counted class like octave_value, you would
> do something like
>
>   octave_value (octave_value&& other)
>     : rep (other.rep)
>   {
>     other.rep = nullptr;
>   }
>
> to transfer the ownership of REP from the OTHER to *THIS instead of
> manipulating the reference count.  Then the destructor should also be
> changed so that it will skip decrementing the reference count if REP is
> nullptr:
>
>   ~octave_value (void)
>   {
>     if (rep && --rep->count == 0)
>       delete rep;
>   }
>
> Currently, we just have "if (--rep->count == 0)" but that will fail if we
> introduce a move constructor.
>

This is also how I understand "zeroing out" objects that use pointers.


> Instead of looking at whether the destructor is called, maybe it would be
> better to trace how many times atomic variables are accessed and modified?

I think that is what needs to be monitored.


>
> I have been experimenting with the attached patches:
>
>   diffs-1.txt: Change the string_vector class so it contains an
> Array<std::string> object instead of deriving from Array<std::string>
> (next step is to use some other container instead of the heavyweight
> Array<T>).
>
>   diffs-2.txt: Introduce move constructors and assignment operators for
> the Array, string_vector and octave_value_list classses.
>
>   diffs-3.txt: Improve the tree_evaluator::convert_to_const_vecctor
> function so that it doesn't create as many octave_value_list objects.
>
> I think it is worth attempting to add move constructors and assignment
> operators to any classes we define that need to be copied.  As we do for
> copy constructors and assignment operators, it would also make sense to
> use the "= default" specification where possible so that the compiler can
> generate these functions, or to use "= delete" to explicitly state that
> they should not be defined.

This article on Stack Overflow discusses how the Rule of 3 has now become
more like the Rule of 5
(https://stackoverflow.com/questions/4782757/rule-of-three-becomes-rule-of-five-with-c11?noredirect=1). 
If we have classes which are complex enough to require one or more
non-default constructors then we should (as of C++11) be providing the
other constructors or indicating to the compiler what we wish to have happen. 

Interestingly, I think we could use '= delete' to disable move constructors
and then use the compiler to find locations in the code base that would use
a move constructor.  At that point could decide whether to write a move
constructor for the particular class, or re-structure the API for the
function such that it is passed a reference to a class object to work on.

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
On 8/21/19 11:49 PM, Rik wrote:
> On 08/21/2019 01:23 PM, John W. Eaton wrote:

> +  octave_value_list& operator = (octave_value_list&& obj)
> +  {
> +    if (this != &obj)
> +      {
> +        m_data = std::move (obj.m_data);
> +        m_names = std::move (obj.m_names);
> +      }
> +
> +    return *this;
> +  }
>
> Wouldn't you need a call such as "m_data.clear ();" to get rid of any
> existing data before assigning to it?

 From the explanations and examples I see, I understand "m_data =
std::move (obj.m_data);" to be the correct way to invoke the move
assignment operator for the data members.  So those operators should
take care of the move assignment semantics for these data member objects.

However, I have the move assignment operators that I wrote for reference
counted classes implemented incorrectly.  To preserve copy-on-write
semantics, I think that instead of the following (simplified from
Octave's Array class by omitting slice and dimension data members):

   Array<T>& operator = (Array<T>&& a)
   {
     if (this != &a)
       {
         rep = a.rep;
         a.rep = nullptr;
       }

     return *this;
   }

it should be

   Array<T>& operator = (Array<T>&& a)
   {
     if (this != &a)
       {
         if (--rep->count == 0)
           delete rep;

         rep = a.rep;
         a.rep = nullptr;
       }

     return *this;
   }

This way, we correctly update the reference count and possibly delete
the object associated with "this" before pointing to the new rep object.
  We still eliminate one atomic increment on a reference count.  But my
earlier version will result in incorrect reference counts after move
assignments.  I'll try to look at fixing that and post another version
of the patch tomorrow.

>> Instead of looking at whether the destructor is called, maybe it would be
>> better to trace how many times atomic variables are accessed and modified?
>
> I think that is what needs to be monitored.

It should be fairly easy to count all operations on the atomic variables
that are used in the refcount class.  Are there any others that are not
wrapped in that class?

> This article on Stack Overflow discusses how the Rule of 3 has now become
> more like the Rule of 5
> (https://stackoverflow.com/questions/4782757/rule-of-three-becomes-rule-of-five-with-c11?noredirect=1).
> If we have classes which are complex enough to require one or more
> non-default constructors then we should (as of C++11) be providing the
> other constructors or indicating to the compiler what we wish to have happen.

Yeah.  More powerful, more complicated.

> Interestingly, I think we could use '= delete' to disable move constructors
> and then use the compiler to find locations in the code base that would use
> a move constructor.  At that point could decide whether to write a move
> constructor for the particular class, or re-structure the API for the
> function such that it is passed a reference to a class object to work on.

OK.

jwe




Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
I pushed the following changesets

   http://hg.savannah.gnu.org/hgweb/octave/rev/7335ebd4c798
   http://hg.savannah.gnu.org/hgweb/octave/rev/b00ddc40b89a

to add move constructors and move assignment operators for the
dim_vector, Array, string_vector, and octave_value classes.  For me,
these changes and the others that have been made for performance since
bug #56752 was reported drops the time to run the bm_toeplitz.m file
from about 39 seconds down to about 26 seconds.  With version 4.4.1, I
see a time of about 31 seconds.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Rik-4
On 08/22/2019 10:34 PM, John W. Eaton wrote:

> I pushed the following changesets
>
>   http://hg.savannah.gnu.org/hgweb/octave/rev/7335ebd4c798
>   http://hg.savannah.gnu.org/hgweb/octave/rev/b00ddc40b89a
>
> to add move constructors and move assignment operators for the
> dim_vector, Array, string_vector, and octave_value classes.  For me,
> these changes and the others that have been made for performance since
> bug #56752 was reported drops the time to run the bm_toeplitz.m file from
> about 39 seconds down to about 26 seconds.  With version 4.4.1, I see a
> time of about 31 seconds.
>

This is a good result.  I'm getting an 18% improvement on the dev branch
from 13.3 to 10.9 seconds.  This is about equivalent historically to
version 4.2.1 which took 10.5 seconds.

I also tried compiling with '--disable-atomic-refcount'.  The benchmark
then declines to 9.7 seconds which is a -11% change.  This suggests there
is still some room for improvement, but that we will need to look elsewhere
for major improvements.

The absolute best performer is still version 3.4.3 at 4.5 seconds which is
more than double the performance of the current result of 10.9 seconds.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
On 8/23/19 2:56 PM, Rik wrote:

> The absolute best performer is still version 3.4.3 at 4.5 seconds which is
> more than double the performance of the current result of 10.9 seconds.

Do we have any idea what is responsible for most of the change?  Is it
looking up functions?  Getting variable values?  Storing values in
variables?  Something else?

jwe



Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

nrjank
On Fri, Aug 23, 2019 at 5:06 PM John W. Eaton <[hidden email]> wrote:
On 8/23/19 2:56 PM, Rik wrote:

> The absolute best performer is still version 3.4.3 at 4.5 seconds which is
> more than double the performance of the current result of 10.9 seconds.

Do we have any idea what is responsible for most of the change?  Is it
looking up functions?  Getting variable values?  Storing values in
variables?  Something else?

in what version did the profiler become available? 
 

jwe



Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

nrjank
On Fri, Aug 23, 2019 at 5:16 PM Nicholas Jankowski <[hidden email]> wrote:
On Fri, Aug 23, 2019 at 5:06 PM John W. Eaton <[hidden email]> wrote:
On 8/23/19 2:56 PM, Rik wrote:

> The absolute best performer is still version 3.4.3 at 4.5 seconds which is
> more than double the performance of the current result of 10.9 seconds.

Do we have any idea what is responsible for most of the change?  Is it
looking up functions?  Getting variable values?  Storing values in
variables?  Something else?

in what version did the profiler become available? 

lazy question.  according to the faq [1] the profiler became available in version 3.6.x.  was the performance of 3.6 good enough relative to current code to make a profiler comparison of your test worthwhile? 
 
Reply | Threaded
Open this post in threaded view
|

Re: Longitudinal Octave Performance

Rik-4
On 08/23/2019 02:20 PM, Nicholas Jankowski wrote:
On Fri, Aug 23, 2019 at 5:16 PM Nicholas Jankowski <[hidden email]> wrote:
On Fri, Aug 23, 2019 at 5:06 PM John W. Eaton <[hidden email]> wrote:
On 8/23/19 2:56 PM, Rik wrote:

> The absolute best performer is still version 3.4.3 at 4.5 seconds which is
> more than double the performance of the current result of 10.9 seconds.

Do we have any idea what is responsible for most of the change?  Is it
looking up functions?  Getting variable values?  Storing values in
variables?  Something else?

in what version did the profiler become available? 

lazy question.  according to the faq [1] the profiler became available in version 3.6.x.  was the performance of 3.6 good enough relative to current code to make a profiler comparison of your test worthwhile? 
 

Version 3.2.4 3.4.3 3.6.4 3.8.2 4.0.3 4.2.1 4.4.1 5.1.0
Benchmark







bm.toeplitz.orig.m 5.8968 4.4943 5.0851 5.5534 10.055 10.544 13.052 13.481

The slowdown from 3.4.X to 3.6.X was about 11%.  That could have been caused by the profiler.

But, as mentioned in earlier e-mails, the critical step change was from 3.8.X to 4.0.X where performance halved.  The two big additions for 4.0.X were a Qt GUI and classdef object oriented programming support.  Running with --no-gui-libs or --no-gui shows less than a 2% difference from running with the GUI so it isn't that.  It might be the classdef OO interface, but I don't know a convenient way to disable that.

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Rik-4
In reply to this post by John W. Eaton
On 08/23/2019 02:06 PM, John W. Eaton wrote:
> On 8/23/19 2:56 PM, Rik wrote:
>
>> The absolute best performer is still version 3.4.3 at 4.5 seconds which is
>> more than double the performance of the current result of 10.9 seconds.
>
> Do we have any idea what is responsible for most of the change?  Is it
> looking up functions?  Getting variable values?  Storing values in
> variables?  Something else?

Nothing that would amount to a 100% slowdown.

I ran the various tests you asked for earlier.  Removing OO lookup from
within the xfind routine saved about 7%.  That's interesting, but no where
near sufficient.  Assignment has gotten 5.6X slower between 3.2.4 and dev. 
I used bm.assign.m which is attached to bug #56752, but I reproduce it here

--- Code bm.assign.m ---
runs = 5;

cumulate = 0; b = 0; z = 0;
for i = 1:runs
  b = zeros (620, 620);
  tic;
    for j = 1:620
      for k = 1:620
        z = 13;
      end
    end
  timing = toc;
  cumulate = cumulate + timing;
end

## Total time
cumulate
--- End Code ---

The only thing it does is assign the static value 13 to the variable z.  In
3.2.4 this took 0.17 seconds and now it takes 0.96.

I think I would start investigations with this script.  Is it the creation
of a new octave_value for 13 every time?  Is it assignment?

I'll be out for the next couple of days,
Rik


Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Lars Winterfeld-2
Am 24.08.19 um 07:01 schrieb Rik:

> --- Code bm.assign.m ---
> runs = 5;
>
> cumulate = 0; b = 0; z = 0;
> for i = 1:runs
>   b = zeros (620, 620);
>   tic;
>     for j = 1:620
>       for k = 1:620
>         z = 13;
>       end
>     end
>   timing = toc;
>   cumulate = cumulate + timing;
> end
>
> ## Total time
> cumulate
> --- End Code ---
>
> The only thing it does is assign the static value 13 to the variable z.  In
> 3.2.4 this took 0.17 seconds and now it takes 0.96.
>
> I think I would start investigations with this script.  Is it the creation
> of a new octave_value for 13 every time?  Is it assignment?


Hi,

maybe you do this already, but what you can do to benchmark is:
(under Linux, have a few minutes patience)

valgrind --tool=callgrind octave-cli -q ~/bm.assign.m
kcachegrind "$(ls -1t callgrind.out.* | head -1)"

I have octave 4.2.2 installed (Ubuntu version, without debug symbols).
On first sight, those functions take time in the bm.assign.m example:

octave_lvalue::assign(octave_value::assign_op, octave_value const&)

octave::tree_evaluator::visit_simple_for_command(tree_simple_for_command&)

octave::tree_evaluator::visit_statement(tree_statement&)

tree_simple_assignment::rvalue1(int)

(Makes sense, the code is mainly makes assignments.)

I understand that legacy code sometimes abuses a "pass by reference" to
return a value, but I would advise against formulating it as a new
recommendation. In modern C++, you can easily return objects (or even
tuple of objects). Maybe check if you are compiling with C++14 or higher
(I believe the requirements of return value optimization were relaxed at
some point).

Haven't looked at the code, but I'd also advise to use more move
semantics. Just to spelled it out, in the above example this would
involve writing the move constructor
visit_simple_for_command(visit_simple_for_command&&)
(and maybe a move assignment) and say "a = std::move(b);" on the last
use of b. Or push_back(move(b)), etc.

Likewise
octave_lvalue::assign(octave_value::assign_op, const octave_value&&)
and use
assign(op, move(val));
when you no longer need val afterwards.

Hope it helps,
Lars

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
In reply to this post by Rik-4
On 8/24/19 1:01 AM, Rik wrote:

> The only thing it does is assign the static value 13 to the variable z.  In
> 3.2.4 this took 0.17 seconds and now it takes 0.96.
>
> I think I would start investigations with this script.  Is it the creation
> of a new octave_value for 13 every time?  Is it assignment?

I spent a lot of time over the last week looking at the interpreter.
It's a bunch of different things, but the biggest factor seems to be
using the tree walker pattern instead of just doing virtual dispatch to
evaluation methods in the parse tree objects themselves.  I think I can
speed it up considerably, possibly even get back much closer to the
3.4.3 performance just by going back to something more like what we used
to have for the evaluator (but keeping evaluator state in an object
instead of as global data).  Another option is to implement a byte code
interpreter instead of directly evaluating the parse tree, but that is
more work.  OTOH, it might be instructive and helpful for JIT compiling.

I'll try to provide a more complete summary of what I found with some
patches tomorrow.

This experience does show that we need some kind of performance testing,
and not just an aggregate time required to running the test suite.  We
need a set of tests to check performance of specific features.  But I'm
not sure how best to do comparisons.  What do we use for baseline values?

jwe


Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Rik-4
On 08/29/2019 06:43 PM, John W. Eaton wrote:

> On 8/24/19 1:01 AM, Rik wrote:
>
>> The only thing it does is assign the static value 13 to the variable z.  In
>> 3.2.4 this took 0.17 seconds and now it takes 0.96.
>>
>> I think I would start investigations with this script.  Is it the creation
>> of a new octave_value for 13 every time?  Is it assignment?
>
> I spent a lot of time over the last week looking at the interpreter. It's
> a bunch of different things, but the biggest factor seems to be using the
> tree walker pattern instead of just doing virtual dispatch to evaluation
> methods in the parse tree objects themselves.  I think I can speed it up
> considerably, possibly even get back much closer to the 3.4.3 performance
> just by going back to something more like what we used to have for the
> evaluator (but keeping evaluator state in an object instead of as global
> data).  Another option is to implement a byte code interpreter instead of
> directly evaluating the parse tree, but that is more work.  OTOH, it
> might be instructive and helpful for JIT compiling.

This sounds exciting.  I would probably proceed conservatively, i.e.,
recover past performance by reverting to the previous code pattern, before
embarking on JIT which could be a very big project indeed.

>
> I'll try to provide a more complete summary of what I found with some
> patches tomorrow.
>
> This experience does show that we need some kind of performance testing,
> and not just an aggregate time required to running the test suite.  We
> need a set of tests to check performance of specific features.  But I'm
> not sure how best to do comparisons.  What do we use for baseline values?
>

Potentially tedious, but not difficult in the same way that locating and
solving a performance issue is.  Ideally we would have tests that are
fine-grained enough to test one subsystem at a time so that improvements in
one area aren't offset by losses in another so we don't perceive what is
really happening.  Since you know the architecture the best I think you can
probably make a stab at the partitioning and which subsystems those might be. 

I don't think it matters when we establish a baseline because the results
are only necessary to validate whether a proposed code change caused an
increase or decrease in performance.  If we start now, we should see
decreases for a while as the code is improved.  After several iterations
the size of the decreases will get smaller until we decide that it isn't
worth working on anymore.  But right now, I suspect there is still
low-hanging fruit.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

Daniel Sebald
In reply to this post by John W. Eaton
On 8/29/19 9:43 PM, John W. Eaton wrote:

> On 8/24/19 1:01 AM, Rik wrote:
>
>> The only thing it does is assign the static value 13 to the variable
>> z.  In
>> 3.2.4 this took 0.17 seconds and now it takes 0.96.
>>
>> I think I would start investigations with this script.  Is it the
>> creation
>> of a new octave_value for 13 every time?  Is it assignment?
>
> I spent a lot of time over the last week looking at the interpreter.
> It's a bunch of different things, but the biggest factor seems to be
> using the tree walker pattern instead of just doing virtual dispatch to
> evaluation methods in the parse tree objects themselves.  I think I can
> speed it up considerably, possibly even get back much closer to the
> 3.4.3 performance just by going back to something more like what we used
> to have for the evaluator (but keeping evaluator state in an object
> instead of as global data).  Another option is to implement a byte code
> interpreter instead of directly evaluating the parse tree, but that is
> more work.  OTOH, it might be instructive and helpful for JIT compiling.
>
> I'll try to provide a more complete summary of what I found with some
> patches tomorrow.
>
> This experience does show that we need some kind of performance testing,
> and not just an aggregate time required to running the test suite.  We
> need a set of tests to check performance of specific features.  But I'm
> not sure how best to do comparisons.  What do we use for baseline values?

Is there some type of builtin function like

speedtest(test)

where test might be "parse", "matmult", "fft", "discio"?  Then come up
with some type of basic C++ routine that does operations resembling some
of the test types.  That is, the C++ implementation might not be general
but it would be a baseline for the most efficient one.  So long as the
algorithms stay fixed, then one can do a relative comparison against the
speedtest() result.  That *might* take the computer/CPU/bus/design speed
out of the picture.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

ederag
In reply to this post by John W. Eaton
On vendredi 30 août 2019 03:43:52 CEST John W. Eaton wrote:
> This experience does show that we need some kind of performance testing,
> and not just an aggregate time required to running the test suite.  We
> need a set of tests to check performance of specific features.  But I'm
> not sure how best to do comparisons.  What do we use for baseline values?

benchmark2 could help with micro-benchmarking:
https://savannah.gnu.org/patch/?9631



Reply | Threaded
Open this post in threaded view
|

Re: move constructors likely a requirement

John W. Eaton
Administrator
In reply to this post by Rik-4
On 8/30/19 12:42 AM, Rik wrote:
> On 08/29/2019 06:43 PM, John W. Eaton wrote:

>> I spent a lot of time over the last week looking at the interpreter. It's
>> a bunch of different things, but the biggest factor seems to be using the
>> tree walker pattern instead of just doing virtual dispatch to evaluation
>> methods in the parse tree objects themselves.  I think I can speed it up
>> considerably, possibly even get back much closer to the 3.4.3 performance
>> just by going back to something more like what we used to have for the
>> evaluator (but keeping evaluator state in an object instead of as global
>> data).  Another option is to implement a byte code interpreter instead of
>> directly evaluating the parse tree, but that is more work.  OTOH, it
>> might be instructive and helpful for JIT compiling.
>
> This sounds exciting.  I would probably proceed conservatively, i.e.,
> recover past performance by reverting to the previous code pattern, before
> embarking on JIT which could be a very big project indeed.

I pushed the following changes:

   http://hg.savannah.gnu.org/hgweb/octave/rev/5bf76ab4cce3
   http://hg.savannah.gnu.org/hgweb/octave/rev/a2d3fa82b730
   http://hg.savannah.gnu.org/hgweb/octave/rev/fcaecdbc8d8a

The last changeset is the most significant and reverts the changes made
for version 4.4 to use the visitor pattern for expression evaluation.
As I note in the commit message, although it is desirable to have all
parse tree evaluation functions grouped together in a single file, using
the visitor pattern can be inefficient, especially when the visitor
function is small and the extra levels of indirection and virtual
function resolution can take more time than the evaluation function
itself (evaluation of constants, for example).  That's a big reason for
the huge hit in performance when the loop contains only a simple
assignment of a constant value.

I'm not proposing a JIT compiler as an immediate fix.  My suggestion of
a byte compiler was to walk the parse tree once and emit a list of
simple instructions that could be evaluated without having to
recursively walk the parse tree again.  I suspect that would be more
efficient, but it would also require a significant amount of work.  And
if we are going this route, maybe we should be looking for an existing
virtual machine to use (JVM, even) instead of creating our own from scratch.

Another simpler thing that would provide immediate benefit would be to
improve or replace the unwind_protect objects we use to save and restore
the interpreter state.  Currently, the unwind_protect class uses a stack
to store actions to perform when the unwind_protect object goes out of
scope.  It also uses virtual functions to choose the specific type of
action to perform.  That's nice and generic, but in many cases, we just
want to save and restore a single value and the overhead is huge.  I
added some counting to the unwind_protect class and found that when the
actions are performed, we often have just a few objects to clean up or
values to restore.  Here are the stats I collected when running "make
check":

   size   instances
     9           8
     7          15
     6         101
     5     1358693
     3      736350
     2       40139
     1    12775752
     0     8170259

So in the vast majority of cases, we only have one action to perform, so
creating a stack and using virtual function dispatch doesn't make much
sense.  We could speed this up a lot just by defining a few special case
classes that save and restore single values or call a single function
(and with lambda capture, we just need a single simple interface).  I'll
take a look at doing that soon.

jwe

12