Hotspot identification

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Hotspot identification

Rik-4
Some code to inspect.  I am posting this on bug report #56752, but CC'ing
the Maintainers-List as I think it is of general interest to developers.

+verbatim+
- 43.69% octave::tree_evaluator::evaluate_internal
   - 41.90% octave::tree_evaluator::visit_index_expression
      - 14.81% octave::symbol_table::find_function
         + 14.69% octave::symbol_table::fcn_table_find
      - 11.93% octave::tree_evaluator::convert_to_const_vector
         + 8.44% octave::tree_evaluator::evaluate
           0.93% octave_value_list::octave_value_list
           0.65% octave_value_list::octave_value_list
      + 4.56% octave_builtin::call
        1.35% Array<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >::operator=
        1.14% Array<octave_value>::operator=
        1.04% octave_value_list::octave_value_list
        0.93% Array<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >::~Array
        0.85% Array<octave_value>::~Array
        0.68% octave::tree_evaluator::is_variable
        0.59% octave_value_list::operator=
-verbatim-

I'm not sure how the symbol_table is organized, but it is taking a long
time to find functions.  If we are using a std::map from STL which is based
on a log2 lookup then perhaps a change to std::unordered_map which is based
on hashes and O(1) lookups would help.

There also seems to be a lot of creation, assignment, destruction of
objects.  Worst case, we have objects with heavyweight constructors. 
Fixing that would require making the constructors/destructors do less, or
introducing an entirely different lightweight object.

Second hotspot in lvalue.

+verbatim+
- 19.84% octave::tree_index_expression::lvalue
   - 15.48% octave::tree_evaluator::convert_to_const_vector
      + 5.44% octave::tree_evaluator::evaluate
      + 4.60% octave_value_list::octave_value_list
        1.57% octave_value_list::octave_value_list
        0.91% std::__cxx11::_List_base<octave_value_list,
std::allocator<octave_value_list> >::_M_clear
     0.54% octave::octave_lvalue::set_index
-verbatim-

class octave_value_list is based on the Array class "Array<octave_value>
data;" which may, in effect, be a heavyweight constructor.  Perhaps it
would be faster to base octave_value_list off one of the list classes in
the STL.  Given that one doesn't actually need a lot of insertion/deletion
of nodes, std::vector might be a good choice.  I wouldn't go that
direction, however, unless there is more proof that this is an issue.

Third hotspot:

+verbatim+
- 6.86% octave::octave_lvalue::assign
   - 6.64% octave_value::assign
      - 6.25% octave_value::subsasgn
         - 6.19% octave_base_matrix<NDArray>::subsasgn
            - 5.34% octave_base_value::numeric_assign
               + 3.40% oct_assignop_assign
                 0.68% __strlen_avx2
-verbatim-

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: Hotspot identification

John W. Eaton
Administrator
On 8/13/19 2:09 PM, Rik wrote:

> I'm not sure how the symbol_table is organized, but it is taking a long
> time to find functions.  If we are using a std::map from STL which is based
> on a log2 lookup then perhaps a change to std::unordered_map which is based
> on hashes and O(1) lookups would help.

Function lookup is slow.  We have to check for overloaded functions.
See fcn_info::fcn_info_rep::xfind in fcn-info.cc for all the things we
look for before finding a built-in function.  Is there some better way
to do it?  For example, is there a way to cache a previous lookup and
not have to search again if we know that the load path has not changed
and no new overloaded functions could possibly be found?

I think the lvalue thing is complicated by having to support the special
meaning of "end" inside index expressions.  As I recall, there are some
weird cases we try to handle for compatibility.  Maybe there are also
better ways of doing that.
>
> There also seems to be a lot of creation, assignment, destruction of
> objects.  Worst case, we have objects with heavyweight constructors.
> Fixing that would require making the constructors/destructors do less, or
> introducing an entirely different lightweight object.

Yeah, octave_value_list is a weird object that should probably be
replaced by a real list, or an object based on a standard library object
instead of Octave's Array<T>.

> Second hotspot in lvalue.

I think the lvalue thing is complicated by having to support the special
meaning of "end" inside index expressions.  As I recall, there are some
weird cases we try to handle for compatibility.  Maybe there are also
better ways of doing that.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: Hotspot identification

Rik-4
On 08/13/2019 11:51 AM, John W. Eaton wrote:

> On 8/13/19 2:09 PM, Rik wrote:
>
>> I'm not sure how the symbol_table is organized, but it is taking a long
>> time to find functions.  If we are using a std::map from STL which is based
>> on a log2 lookup then perhaps a change to std::unordered_map which is based
>> on hashes and O(1) lookups would help.
>
> Function lookup is slow.  We have to check for overloaded functions. See
> fcn_info::fcn_info_rep::xfind in fcn-info.cc for all the things we look
> for before finding a built-in function.  Is there some better way to do
> it?  For example, is there a way to cache a previous lookup and not have
> to search again if we know that the load path has not changed and no new
> overloaded functions could possibly be found?

I think this is something useful to look at.  From the longitudinal
benchmarking, there was a step change loss of performance from the 3.8.X
series to 4.0.X series.  Looking at the NEWS file for 4.0.0, that release
featured the addition of a Qt-based GUI and classdef OO programming. 
Running today with --no-gui-libs and --no-gui shows very little difference
so I don't think the slowdown is because of the GUI.  That would leave
classdef as a large architectural change capable of a 2X slowdown.

I tried quick test by commenting out the classdef section of
fcn_info::fcn_info_rep::xfind.  It was meaningful, runtime went from 14 s
to 13 s, but at ~7% is no where near large enough to explain the complete
slowdown.  Commenting out the code for packages saved an additional 0.75 s
which is nice, but also not stupendous.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: Hotspot Identification

Rik-4
In reply to this post by Rik-4
Here is another possibility.  I find that octave_value_list is often taking
~1% of an particular leaf function.  If I check the annotated code I see
that atomic locking instructions take a very long time.

--- Start Code Annotation ---
octave_value_list::octave_value_list 
/home/rik/wip/Projects_Mine/octave-dev/libgui/.libs/liboctgui.so.5.0.0
Samples│    _ZN17octave_value_listC2Ev():
       │    OCTINTERP_API
       │    octave_value_list
       │    {
       │    public:
       │
       │      octave_value_list (void)
       │      push   %rbp
    11 │      mov    %rsp,%rbp
     7 │      push   %r12
     1 │      push   %rbx
       │    _ZN17octave_value_listC1Ev():
     1 │      add    $0x10,%rax
       │    _ZN17octave_value_listC2Ev():
    15 │      mov    %rdi,%rbx
       │    _ZN17octave_value_listC1Ev():
     6 │      mov    %rax,(%rdi)
       │
       │    public:
       │
       │      static octave_idx_type dim_max (void);
       │
       │      explicit dim_vector (void) : rep (nil_rep ())
     1 │    → callq  dim_vector::nil_rep()@plt
     3 │      mov    %rax,0x8(%rbx)
       │      { OCTAVE_ATOMIC_INCREMENT (&(count ())); }
   327 │      lock   addq   $0x1,-0x10(%rax)
       │        : dimensions (), rep (nil_rep ()), slice_data (rep->data),
     1 │    → callq  Array<octave_value
       │          slice_len (rep->len)
    10 │      mov    (%rax),%rdx
     7 │      mov    %rax,0x10(%rbx)
       │      mov    %rdx,0x18(%rbx)
       │      mov    0x8(%rax),%rdx
    14 │      mov    %rdx,0x20(%rbx)
       │          return OCTAVE_ATOMIC_INCREMENT (&m_count);
       │        }
       │
       │        count_type operator++ (int)
       │        {
       │          return OCTAVE_ATOMIC_POST_INCREMENT (&m_count);
   297 │      lock   addl   $0x1,0x10(%rax)
     1 │      mov    vtable for Array<std::__cxx11::basic_string<char,
std::char_traits<char,%rax
       │      add    $0x10,%rax
    16 │      mov    %rax,0x28(%rbx)
       │      explicit dim_vector (void) : rep (nil_rep ())
       │    → callq  dim_vector::nil_rep()@plt
       │      mov    %rax,0x30(%rbx)
       │      { OCTAVE_ATOMIC_INCREMENT (&(count ())); }
   294 │      lock   addq   $0x1,-0x10(%rax)
       │        : dimensions (), rep (nil_rep ()), slice_data (rep->data),
     1 │    → callq  Array<std::__cxx11::basic_string<char,
std::char_traits<char
       │          slice_len (rep->len)

--- End Code Annotation ---

I can change the atomic instructions to ordinary ones by configuring with
--disable-atomic-refcount.  The benchmark runtime drops from 14.1 seconds
to 11.6 seconds (2.5 seconds) which seems important.

The requirement for atomic refcounting was introduced by communication with
the GUI.  This brings up a hard question, is there a better way to
implement cross-thread communication?

--Rik
Reply | Threaded
Open this post in threaded view
|

Re: Hotspot Identification

Daniel Sebald
On 8/13/19 5:21 PM, Rik wrote:

> Here is another possibility.  I find that octave_value_list is often taking
> ~1% of an particular leaf function.  If I check the annotated code I see
> that atomic locking instructions take a very long time.
>
> --- Start Code Annotation ---
> octave_value_list::octave_value_list
> /home/rik/wip/Projects_Mine/octave-dev/libgui/.libs/liboctgui.so.5.0.0
> Samples│    _ZN17octave_value_listC2Ev():
>         │    OCTINTERP_API
>         │    octave_value_list
>         │    {
>         │    public:
>         │
>         │      octave_value_list (void)
>         │      push   %rbp
>      11 │      mov    %rsp,%rbp
>       7 │      push   %r12
>       1 │      push   %rbx
>         │    _ZN17octave_value_listC1Ev():
>       1 │      add    $0x10,%rax
>         │    _ZN17octave_value_listC2Ev():
>      15 │      mov    %rdi,%rbx
>         │    _ZN17octave_value_listC1Ev():
>       6 │      mov    %rax,(%rdi)
>         │
>         │    public:
>         │
>         │      static octave_idx_type dim_max (void);
>         │
>         │      explicit dim_vector (void) : rep (nil_rep ())
>       1 │    → callq  dim_vector::nil_rep()@plt
>       3 │      mov    %rax,0x8(%rbx)
>         │      { OCTAVE_ATOMIC_INCREMENT (&(count ())); }
>     327 │      lock   addq   $0x1,-0x10(%rax)
>         │        : dimensions (), rep (nil_rep ()), slice_data (rep->data),
>       1 │    → callq  Array<octave_value
>         │          slice_len (rep->len)
>      10 │      mov    (%rax),%rdx
>       7 │      mov    %rax,0x10(%rbx)
>         │      mov    %rdx,0x18(%rbx)
>         │      mov    0x8(%rax),%rdx
>      14 │      mov    %rdx,0x20(%rbx)
>         │          return OCTAVE_ATOMIC_INCREMENT (&m_count);
>         │        }
>         │
>         │        count_type operator++ (int)
>         │        {
>         │          return OCTAVE_ATOMIC_POST_INCREMENT (&m_count);
>     297 │      lock   addl   $0x1,0x10(%rax)
>       1 │      mov    vtable for Array<std::__cxx11::basic_string<char,
> std::char_traits<char,%rax
>         │      add    $0x10,%rax
>      16 │      mov    %rax,0x28(%rbx)
>         │      explicit dim_vector (void) : rep (nil_rep ())
>         │    → callq  dim_vector::nil_rep()@plt
>         │      mov    %rax,0x30(%rbx)
>         │      { OCTAVE_ATOMIC_INCREMENT (&(count ())); }
>     294 │      lock   addq   $0x1,-0x10(%rax)
>         │        : dimensions (), rep (nil_rep ()), slice_data (rep->data),
>       1 │    → callq  Array<std::__cxx11::basic_string<char,
> std::char_traits<char
>         │          slice_len (rep->len)
>
> --- End Code Annotation ---
>
> I can change the atomic instructions to ordinary ones by configuring with
> --disable-atomic-refcount.  The benchmark runtime drops from 14.1 seconds
> to 11.6 seconds (2.5 seconds) which seems important.

Nice work Rik.  So the versions where we see a jump in consumption, are
they correlated with the introduction of the GUI?  Roughly 4.0?  That
sounds about right.


> The requirement for atomic refcounting was introduced by communication with
> the GUI.  This brings up a hard question, is there a better way to
> implement cross-thread communication?

I believe so.  Signals/slots are thread safe.  One can declare an
octave_value/octave_value_ptr as a QObject then shuffle it across a
signal/slot.  All the octave routines return an octave_value or pointer
or something.  One has to be efficient about it though.  On the GUI
side, only request the amount of information that would, say, be
visible.  E.g., just some small sub-matrix, not a whole 1000 x 1000 matrix.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: Hotspot Identification

John W. Eaton
Administrator
On 8/14/19 1:33 AM, Daniel J Sebald wrote:
> On 8/13/19 5:21 PM, Rik wrote:

>> The requirement for atomic refcounting was introduced by communication
>> with
>> the GUI.  This brings up a hard question, is there a better way to
>> implement cross-thread communication?
>
> I believe so.  Signals/slots are thread safe.  One can declare an
> octave_value/octave_value_ptr as a QObject then shuffle it across a
> signal/slot.

I'm not sure what you mean by octave_value_ptr.

Objects used as arguments in signal/slot functions don't have to be
QObjects.  Are you thinking of how they must be declared and registered
as meta types for Qt?

Requirements for passing objects across threads are just that they have
to have a public default constructor, a public copy constructor, and a
public destructor.  Requirements are explained here:

   https://doc.qt.io/qt-5/custom-types.html

If the object uses reference counting, then operations on the reference
count must be thread safe.  Using atomic operations for reference
counting is more efficient than using mutexes.  If you share data with a
reference counted class and need to modify the shared data in multiple
threads, you'll need to protect the sections of code that modify data.
With Octave's copy-on-write semantics for things like octave_value and
Array objects, we don't have to worry about that, because any operation
that will modify a value will first force a copy and the thread that
will modify the data will own the copy.

I think atomic reference counting is the right thing to do for these
types when they could be used across threads.

If we want to experiment with other solutions, what about introducing
move constructors and move assignment operators?  When working with a
single thread, that is most often what we want anyway, isn't it?  Can't
we define those as well as copy constructors and assignment operators
that share resources using reference counting and copy-on-write
semantics as before?  Would that help?

I can try to find or define an example class that does what I'm thinking
about in the next day or two.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: Hotspot Identification

Daniel Sebald
On 8/14/19 2:26 AM, John W. Eaton wrote:

> On 8/14/19 1:33 AM, Daniel J Sebald wrote:
>> On 8/13/19 5:21 PM, Rik wrote:
>
>>> The requirement for atomic refcounting was introduced by
>>> communication with
>>> the GUI.  This brings up a hard question, is there a better way to
>>> implement cross-thread communication?
>>
>> I believe so.  Signals/slots are thread safe.  One can declare an
>> octave_value/octave_value_ptr as a QObject then shuffle it across a
>> signal/slot.
>
> I'm not sure what you mean by octave_value_ptr.
>
> Objects used as arguments in signal/slot functions don't have to be
> QObjects.  Are you thinking of how they must be declared and registered
> as meta types for Qt?

Yes, declare it near the top of the objects definition; but right, it
doesn't have to be derived from a Qt object.


> Requirements for passing objects across threads are just that they have
> to have a public default constructor, a public copy constructor, and a
> public destructor.  Requirements are explained here:
>
>    https://doc.qt.io/qt-5/custom-types.html
>
> If the object uses reference counting, then operations on the reference
> count must be thread safe.  Using atomic operations for reference
> counting is more efficient than using mutexes.  If you share data with a
> reference counted class and need to modify the shared data in multiple
> threads, you'll need to protect the sections of code that modify data.
> With Octave's copy-on-write semantics for things like octave_value and
> Array objects, we don't have to worry about that, because any operation
> that will modify a value will first force a copy and the thread that
> will modify the data will own the copy.
>
> I think atomic reference counting is the right thing to do for these
> types when they could be used across threads.

I remember this.  That reference counting wasn't a safe thing across
threads; I tested it.  But when I passed the pointer to the
octave_value, it had no affect on the reference count.  With that, the
GUI can momentarily copy that octave_value data given the pointer, and
let the worker thread continue on its way after the copy is done.
Rather than a mutex, it is like suspending the worker thread for a moment.

Yes, there is some CPU wasted copying the data and holding it in the GUI
until no longer needed (not difficult to program though).  But it avoids
reference counting across threads.  Plus it means the GUI needs another
level of sophistication.  It depends on what the goal is with the
GUI--to be a super efficient realtime demo-like device?  Or a little
more simple simple to program?  Not implying it couldn't have
sophisticated features, just that streaming live data from a serial port
through the system would be a lot of work.


> If we want to experiment with other solutions, what about introducing
> move constructors and move assignment operators?  When working with a
> single thread, that is most often what we want anyway, isn't it?  Can't
> we define those as well as copy constructors and assignment operators
> that share resources using reference counting and copy-on-write
> semantics as before?  Would that help?

Yes, possibly.  In fact, copying/moving is akin to what I described
above; just a little more sophisticated.  But I think the movement
should still revolve around that octave_value.  That way, if one is used
to programming Octave functions in the core, it's essentially the same
sort of knowledge in the GUI.


> I can try to find or define an example class that does what I'm thinking
> about in the next day or two.

OK, sounds good.

Dan


Reply | Threaded
Open this post in threaded view
|

Re: atomic references

Rik-4
I want to write down what we know, as I understand it, so that we don't get
ahead of our skis.

First, it does objectively appear that atomic reference counting has
resulted in a performance hit.

Indirect evidence #1 is the ~2X step change in runtime from 3.8.X to 4.0.X
series which featured the introduction of classdef and the Qt GUI.

Indirect evidence #2 is this changeset which is associated with the 3.8.X
to 4.0.X transition.

changeset:   18584:89b7bd7d0b83
branch:      gui-release
parent:      18577:56209bab4213
user:        Rik <[hidden email]>
date:        Sat Mar 22 13:39:55 2014 -0700
summary:     configure.ac: Use atomic reference counting by default for Qt
toolkit.

Direct evidence #1 is the ~18% performance increase when Octave is
configured with --disable-atomic-refcount.

Given that, the next step is working out a solution.  I don't want to jump
to conclusions here either.

Question #1 : Is there a way to divide accesses between those which will be
cross-thread and those which won't?

If the answer is yes, then we might be able to save CPU cycles if there is
a limited number of objects which need to be passed across threads.

As an example, the Octave worker thread creates many temporary octave_value
objects when it is processing an expression.  None of these ever need to be
seen by the GUI, only the final result assigned to either 'ans' or a
variable name needs to potentially be a cross-thread object.

Could the subset of octave_values to be exported be reduced to just the
ones necessary for Qt plotting?  Or does the GUI require all variables in
the symbol table to also be exported?

Question #2 : If objects must be passed cross-thread, is there a more
efficient way to do it?

jwe stated that mutexes are less efficient then what we have today with
atomic reference counting.  Is there something else we could use from the
C++ standard library?  For example, do we need to implement our own
reference counted objects, or could we just build atop the standard
libraries shared_ptr, weak_ptr objects?  I recognize that those didn't
exist when Octave started, but we should take advantage of what is out
there now, rather than rebuilding the wheel.  Similarly, the C++ libraries
have quite a few functions for threaded applications such as the <atomic>,
<mutex>, <future> headers.  Is there something to grab from them? 
Underlying all this is my sense that we shouldn't need to roll our own code
here and that the libraries will be high performance.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors

Rik-4
In reply to this post by Daniel Sebald

>> If we want to experiment with other solutions, what about introducing
>> move constructors and move assignment operators?  When working with a
>> single thread, that is most often what we want anyway, isn't it?  Can't
>> we define those as well as copy constructors and assignment operators
>> that share resources using reference counting and copy-on-write
>> semantics as before?  Would that help?
>
> Yes, possibly.  In fact, copying/moving is akin to what I described
> above; just a little more sophisticated.  But I think the movement should
> still revolve around that octave_value.  That way, if one is used to
> programming Octave functions in the core, it's essentially the same sort
> of knowledge in the GUI.
>
>
>> I can try to find or define an example class that does what I'm thinking
>> about in the next day or two.
>

This jibes with my sense from the profiling that we are doing a lot of
unnecessary creation/destruction of objects.

*IF* heavyweight constructors are a problem, then calling them less
frequently by using move constructors and move assignment operators would
definitely help.  However, that is still a hypothesis.  Is there a
relatively simple way to test this short of re-writing a bunch of classes?

One option might be to place a counter in the constructor, assignment
operator, and destructor routines for octave_value.

I should also say that I tried implementing a move constructor after
OctConf2018 and I didn't see any improvement.  But, I also could have
screwed it up and I didn't spend a long time on it.

--Rik
I

Reply | Threaded
Open this post in threaded view
|

Re: move constructors

John W. Eaton
Administrator
On 8/14/19 1:05 PM, Rik wrote:

> This jibes with my sense from the profiling that we are doing a lot of
> unnecessary creation/destruction of objects.
>
> *IF* heavyweight constructors are a problem, then calling them less
> frequently by using move constructors and move assignment operators would
> definitely help.  However, that is still a hypothesis.  Is there a
> relatively simple way to test this short of re-writing a bunch of classes?
>
> One option might be to place a counter in the constructor, assignment
> operator, and destructor routines for octave_value.
>
> I should also say that I tried implementing a move constructor after
> OctConf2018 and I didn't see any improvement.  But, I also could have
> screwed it up and I didn't spend a long time on it.
Yeah, see the attached simplified version of our Array class.  There are
a few cases that seem (to me, at least, with my limited understanding)
like it should be possible for the compiler to figure out that it can
call move constructor or the move assignment operator but it doesn't.
But AFAICT, it is following the rules.  OTOH, in some cases, the
compiler may completely remove calls to the copy constructor or
assignment operator even when move constructors are not explicitly
defined.  This happens for me regardless of the optimization level
(using GCC 8.3).

If we see a large penalty for using atomic reference counting, then we
must be performing many operations where we increment and decrement the
count.  Are we doing many of those operations unnecessarily?  If so, and
given what I see with the example class, I'm wondering why more of these
operations are not eliminated.  Can we easily find where they are
happening?  At least that would be a start to understanding what should
be done.

Did you post code for your experiments with move constructors and move
assignment operators?  I don't see it in the archive of the maintainers
list, but I may not be searching for the right thing.

jwe

move.cc (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: atomic references

John W. Eaton
Administrator
In reply to this post by Rik-4
On 8/14/19 12:46 PM, Rik wrote:

> I want to write down what we know, as I understand it, so that we don't get
> ahead of our skis.
>
> First, it does objectively appear that atomic reference counting has
> resulted in a performance hit.
>
> Indirect evidence #1 is the ~2X step change in runtime from 3.8.X to 4.0.X
> series which featured the introduction of classdef and the Qt GUI.
>
> Indirect evidence #2 is this changeset which is associated with the 3.8.X
> to 4.0.X transition.
>
> changeset:   18584:89b7bd7d0b83
> branch:      gui-release
> parent:      18577:56209bab4213
> user:        Rik <[hidden email]>
> date:        Sat Mar 22 13:39:55 2014 -0700
> summary:     configure.ac: Use atomic reference counting by default for Qt
> toolkit.
>
> Direct evidence #1 is the ~18% performance increase when Octave is
> configured with --disable-atomic-refcount.
>
> Given that, the next step is working out a solution.  I don't want to jump
> to conclusions here either.
>
> Question #1 : Is there a way to divide accesses between those which will be
> cross-thread and those which won't?

It seems complicated to me, but it might be possible.  I'd rather avoid
it unless we can come up with simple rules for how it works.  Otherwise
I imagine a never ending list of bugs due to misunderstanding what
objects can be passed across threads and how it is supposed to be done.

> If the answer is yes, then we might be able to save CPU cycles if there is
> a limited number of objects which need to be passed across threads.
>
> As an example, the Octave worker thread creates many temporary octave_value
> objects when it is processing an expression.  None of these ever need to be
> seen by the GUI, only the final result assigned to either 'ans' or a
> variable name needs to potentially be a cross-thread object.
>
> Could the subset of octave_values to be exported be reduced to just the
> ones necessary for Qt plotting?  Or does the GUI require all variables in
> the symbol table to also be exported?

For the workspace browser, we currently pass an object that is a summary
of the current workspace.  But it does contain info from the current
stack frame, including variable values.

> Question #2 : If objects must be passed cross-thread, is there a more
> efficient way to do it?
>
> jwe stated that mutexes are less efficient then what we have today with
> atomic reference counting.  Is there something else we could use from the
> C++ standard library?  For example, do we need to implement our own
> reference counted objects, or could we just build atop the standard
> libraries shared_ptr, weak_ptr objects?  I recognize that those didn't
> exist when Octave started, but we should take advantage of what is out
> there now, rather than rebuilding the wheel.  Similarly, the C++ libraries
> have quite a few functions for threaded applications such as the <atomic>,
> <mutex>, <future> headers.  Is there something to grab from them?
> Underlying all this is my sense that we shouldn't need to roll our own code
> here and that the libraries will be high performance.

I agree that we should use standard library features where possible.

jwe



Reply | Threaded
Open this post in threaded view
|

Re: move constructors

Rik-4
In reply to this post by John W. Eaton
On 08/14/2019 11:21 AM, John W. Eaton wrote:
> On 8/14/19 1:05 PM, Rik wrote: > >> This jibes with my sense from the profiling that we are doing a lot of >> unnecessary creation/destruction of objects. >> >> *IF* heavyweight constructors are a problem, then calling them less >> frequently by using move constructors and move assignment operators would >> definitely help. However, that is still a hypothesis. Is there a >> relatively simple way to test this short of re-writing a bunch of classes? >> >> One option might be to place a counter in the constructor, assignment >> operator, and destructor routines for octave_value. >> >> I should also say that I tried implementing a move constructor after >> OctConf2018 and I didn't see any improvement. But, I also could have >> screwed it up and I didn't spend a long time on it. > > Yeah, see the attached simplified version of our Array class. There are a few cases that seem (to me, at least, with my limited understanding) like it should be possible for the compiler to figure out that it can call move constructor or the move assignment operator but it doesn't. But AFAICT, it is following the rules. OTOH, in some cases, the compiler may completely remove calls to the copy constructor or assignment operator even when move constructors are not explicitly defined. This happens for me regardless of the optimization level (using GCC 8.3). > > If we see a large penalty for using atomic reference counting, then we must be performing many operations where we increment and decrement the count. Are we doing many of those operations unnecessarily? If so, and given what I see with the example class, I'm wondering why more of these operations are not eliminated. Can we easily find where they are happening? At least that would be a start to understanding what should be done.
Just to get an idea, I modified the destructor in Array.h to

    ~ArrayRep (void) {
        static long int n = 0;
        std::cerr << "~ArrayRep: " << ++n << "\n";
        delete [] data; }

I find that running an average statement like "x = 1;" causes 40 destructor calls.  That does seem excessive.

> > Did you post code for your experiments with move constructors and move assignment operators? I don't see it in the archive of the maintainers list, but I may not be searching for the right thing.
No, I was just playing around on my own computer.  It never amounted to anything worth publishing.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors

Rik-4
In reply to this post by John W. Eaton
On 08/14/2019 02:43 PM, Rik wrote:
On 08/14/2019 11:21 AM, John W. Eaton wrote:
> On 8/14/19 1:05 PM, Rik wrote: > >> This jibes with my sense from the profiling that we are doing a lot of >> unnecessary creation/destruction of objects. >> >> *IF* heavyweight constructors are a problem, then calling them less >> frequently by using move constructors and move assignment operators would >> definitely help. However, that is still a hypothesis. Is there a >> relatively simple way to test this short of re-writing a bunch of classes? >> >> One option might be to place a counter in the constructor, assignment >> operator, and destructor routines for octave_value. >> >> I should also say that I tried implementing a move constructor after >> OctConf2018 and I didn't see any improvement. But, I also could have >> screwed it up and I didn't spend a long time on it. > > Yeah, see the attached simplified version of our Array class. There are a few cases that seem (to me, at least, with my limited understanding) like it should be possible for the compiler to figure out that it can call move constructor or the move assignment operator but it doesn't. But AFAICT, it is following the rules. OTOH, in some cases, the compiler may completely remove calls to the copy constructor or assignment operator even when move constructors are not explicitly defined. This happens for me regardless of the optimization level (using GCC 8.3). > > If we see a large penalty for using atomic reference counting, then we must be performing many operations where we increment and decrement the count. Are we doing many of those operations unnecessarily? If so, and given what I see with the example class, I'm wondering why more of these operations are not eliminated. Can we easily find where they are happening? At least that would be a start to understanding what should be done.
Just to get an idea, I modified the destructor in Array.h to

    ~ArrayRep (void) {
        static long int n = 0;
        std::cerr << "~ArrayRep: " << ++n << "\n";
        delete [] data; }

I find that running an average statement like "x = 1;" causes 40 destructor calls.  That does seem excessive.

A similar experiment with modifying the destructor for octave_value_list shows that we are creating/destroying a lot more of these then I think is necessary given the complexity of the statements.  A simple function call seems to require 9 octave_value_list objects.  An anonymous function is 21.

--Rik

:DATA:

octave:3> y = x + 1;
~octave_value_list: 7782
octave:4> y = sin (1);
~octave_value_list: 7783
~octave_value_list: 7784
~octave_value_list: 7785
~octave_value_list: 7786
~octave_value_list: 7787
~octave_value_list: 7788
~octave_value_list: 7789
~octave_value_list: 7790
~octave_value_list: 7791
octave:5> f = @(x) 2*x
f =

@(x) 2 * x
~octave_value_list: 7792
~octave_value_list: 7793
~octave_value_list: 7794

~octave_value_list: 7795
~octave_value_list: 7796
~octave_value_list: 7797
octave:6> y = f(1)
~octave_value_list: 7798
~octave_value_list: 7799
~octave_value_list: 7800
~octave_value_list: 7801
~octave_value_list: 7802
~octave_value_list: 7803
~octave_value_list: 7804
~octave_value_list: 7805
~octave_value_list: 7806
~octave_value_list: 7807
~octave_value_list: 7808
~octave_value_list: 7809
~octave_value_list: 7810
~octave_value_list: 7811
~octave_value_list: 7812
~octave_value_list: 7813
y =  2
~octave_value_list: 7814
~octave_value_list: 7815
~octave_value_list: 7816
~octave_value_list: 7817
~octave_value_list: 7818
~octave_value_list: 7819



Reply | Threaded
Open this post in threaded view
|

Re: move constructors

John W. Eaton
Administrator
On 8/14/19 6:31 PM, Rik wrote:
> On 08/14/2019 02:43 PM, Rik wrote:

>> Just to get an idea, I modified the destructor in Array.h to
>>
>>     ~ArrayRep (void) {
>>         static long int n = 0;
>>         std::cerr << "~ArrayRep: " << ++n << "\n";
>>         delete [] data; }
>>
>> I find that running an average statement like "x = 1;" causes 40
>> destructor calls.  That does seem excessive.

Yeah, that does seem like a lot.  Maybe it can be reduced, but first we
have to know where they are happening and why.

> A similar experiment with modifying the destructor for octave_value_list
> shows that we are creating/destroying a lot more of these then I think
> is necessary given the complexity of the statements.  A simple function
> call seems to require 9 octave_value_list objects.

I'm sure there's a lot of room for improvement.  I traced the calls to
the destructors for the expression "sin(1);".  I found:

   * Three calls in tree_evaluator::convert_to_const_vector.  That is
the function that takes the argument list object in the parse tree and
converts it to an octave_value_list object.

   * Five calls in tree_evaluator::visit_index_expression.  Two of those
are happening at the point of the call to the sin function, so probably
inside the code that calls the function, not the visit_index_expression
function itself.

   * One more call occurs inside the tree_evaluator::evaluate function
when re-initializing the m_expr_result_value_list value.

> y =  2
> ~octave_value_list: 7814
> ~octave_value_list: 7815
> ~octave_value_list: 7816
> ~octave_value_list: 7817
> ~octave_value_list: 7818
> ~octave_value_list: 7819

Not that it's the only issue, but as far as I can tell, these are all
due to calling the disp and display functions to print the result.  If I
execute "y = 2;" to suppress printing, there are no calls to the
octave_value_list destructor.

jwe


Reply | Threaded
Open this post in threaded view
|

Re: move constructors

Rik-4
On 08/15/2019 11:51 AM, John W. Eaton wrote:

> On 8/14/19 6:31 PM, Rik wrote:
>> On 08/14/2019 02:43 PM, Rik wrote:
>
>>> Just to get an idea, I modified the destructor in Array.h to
>>>
>>>     ~ArrayRep (void) {
>>>         static long int n = 0;
>>>         std::cerr << "~ArrayRep: " << ++n << "\n";
>>>         delete [] data; }
>>>
>>> I find that running an average statement like "x = 1;" causes 40
>>> destructor calls.  That does seem excessive.
>
> Yeah, that does seem like a lot.  Maybe it can be reduced, but first we
> have to know where they are happening and why.

Yes, this was mostly about confirming hunches.

>
>> A similar experiment with modifying the destructor for octave_value_list
>> shows that we are creating/destroying a lot more of these then I think
>> is necessary given the complexity of the statements.  A simple function
>> call seems to require 9 octave_value_list objects.
>
> I'm sure there's a lot of room for improvement.  I traced the calls to
> the destructors for the expression "sin(1);".  I found:
>
>   * Three calls in tree_evaluator::convert_to_const_vector.  That is the
> function that takes the argument list object in the parse tree and
> converts it to an octave_value_list object.
>
>   * Five calls in tree_evaluator::visit_index_expression.  Two of those
> are happening at the point of the call to the sin function, so probably
> inside the code that calls the function, not the visit_index_expression
> function itself.
>
>   * One more call occurs inside the tree_evaluator::evaluate function
> when re-initializing the m_expr_result_value_list value.
>

The good news, as you say, is that there is plenty of room for
optimization.  Switching the octave_value_list class to base itself atop
std::vector might get 25% performance improvement.  But optimizing the
design, by calling the octave_value_list constructor only once versus nine
times, could get an 800% improvement.  We won't really get that low, but
far more opportunities if we optimize the design rather than the code.

It seems the first two functions are in pt-eval.cc and the final one in
pt-eval.h.  I was confused because there is another
tree_evaluator::evaluate function in pt-eval.cc, but it makes no reference
to m_expr_result_value_list.  For reference, how did you determine which
functions were causing the calls?  A debugger and a breakpoint in
~octave_value_list()?

>> y =  2
>> ~octave_value_list: 7814
>> ~octave_value_list: 7815
>> ~octave_value_list: 7816
>> ~octave_value_list: 7817
>> ~octave_value_list: 7818
>> ~octave_value_list: 7819
>
> Not that it's the only issue, but as far as I can tell, these are all due
> to calling the disp and display functions to print the result.  If I
> execute "y = 2;" to suppress printing, there are no calls to the
> octave_value_list destructor.
>

Ah, that is comforting.  Running data processing scripts in batch mode will
not need to print out results and so can run fast.  And if a user is
running Octave interactively and inspecting results then an extra
millisecond delay is not noticeable.

--Rik

Reply | Threaded
Open this post in threaded view
|

Re: move constructors

John W. Eaton
Administrator
On 8/15/19 4:33 PM, Rik wrote:

> For reference, how did you determine which
> functions were causing the calls?  A debugger and a breakpoint in
> ~octave_value_list()?

Yes, I used

   ./run-octave -g -cli

and let the interpreter get to the command prompt.  Then C-c to get to
the (gdb) prompt, then set a breakpoint at
octave_value_lists::~octave_value_list.  Then executed something like
"sin(1);" and looked at stack traces each time the breakpoint was hit.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: move constructors

John W. Eaton
Administrator
On 8/15/19 5:52 PM, John W. Eaton wrote:

> On 8/15/19 4:33 PM, Rik wrote:
>
>> For reference, how did you determine which
>> functions were causing the calls?  A debugger and a breakpoint in
>> ~octave_value_list()?
>
> Yes, I used
>
>    ./run-octave -g -cli
>
> and let the interpreter get to the command prompt.  Then C-c to get to
> the (gdb) prompt, then set a breakpoint at
> octave_value_lists::~octave_value_list.  Then executed something like
> "sin(1);" and looked at stack traces each time the breakpoint was hit.

Also, this was with a modified octave_value_list destructor that printed
a message.  So I also stepped through the code after the first
breakpoint was hit and watched for the status messages to be printed
from the destructor.  That is a little faster than looking at stack
traces for each call to the destructor.

jwe


Reply | Threaded
Open this post in threaded view
|

Re: move constructors

Daniel Sebald
In reply to this post by Rik-4
On 8/14/19 6:31 PM, Rik wrote:

> On 08/14/2019 02:43 PM, Rik wrote:
>> On 08/14/2019 11:21 AM, John W. Eaton wrote:
>> > On 8/14/19 1:05 PM, Rik wrote: > >> This jibes with my sense from the profiling that we are doing a
>> lot of >> unnecessary creation/destruction of objects. >> >> *IF*
>> heavyweight constructors are a problem, then calling them less >>
>> frequently by using move constructors and move assignment operators
>> would >> definitely help. However, that is still a hypothesis. Is
>> there a >> relatively simple way to test this short of re-writing a
>> bunch of classes? >> >> One option might be to place a counter in the
>> constructor, assignment >> operator, and destructor routines for
>> octave_value. >> >> I should also say that I tried implementing a move
>> constructor after >> OctConf2018 and I didn't see any improvement.
>> But, I also could have >> screwed it up and I didn't spend a long time
>> on it. > > Yeah, see the attached simplified version of our Array
>> class. There are a few cases that seem (to me, at least, with my
>> limited understanding) like it should be possible for the compiler to
>> figure out that it can call move constructor or the move assignment
>> operator but it doesn't. But AFAICT, it is following the rules. OTOH,
>> in some cases, the compiler may completely remove calls to the copy
>> constructor or assignment operator even when move constructors are not
>> explicitly defined. This happens for me regardless of the optimization
>> level (using GCC 8.3). > > If we see a large penalty for using atomic
>> reference counting, then we must be performing many operations where
>> we increment and decrement the count. Are we doing many of those
>> operations unnecessarily? If so, and given what I see with the example
>> class, I'm wondering why more of these operations are not eliminated.
>> Can we easily find where they are happening? At least that would be a
>> start to understanding what should be done.
>> Just to get an idea, I modified the destructor in Array.h to
>>
>>     ~ArrayRep (void) {
>>         static long int n = 0;
>>         std::cerr << "~ArrayRep: " << ++n << "\n";
>>         delete [] data; }
>>
>> I find that running an average statement like "x = 1;" causes 40
>> destructor calls.  That does seem excessive.
>
> A similar experiment with modifying the destructor for octave_value_list
> shows that we are creating/destroying a lot more of these then I think
> is necessary given the complexity of the statements.  A simple function
> call seems to require 9 octave_value_list objects.  An anonymous
> function is 21.
>
> --Rik
>
> :DATA:
>
> octave:3> y = x + 1;
> ~octave_value_list: 7782
> octave:4> y = sin (1);
> ~octave_value_list: 7783
> ~octave_value_list: 7784
> ~octave_value_list: 7785
> ~octave_value_list: 7786
> ~octave_value_list: 7787
> ~octave_value_list: 7788
> ~octave_value_list: 7789
> ~octave_value_list: 7790
> ~octave_value_list: 7791
> octave:5> f = @(x) 2*x
> f =
>
> @(x) 2 * x
> ~octave_value_list: 7792
> ~octave_value_list: 7793
> ~octave_value_list: 7794
>
> ~octave_value_list: 7795
> ~octave_value_list: 7796
> ~octave_value_list: 7797
> octave:6> y = f(1)
> ~octave_value_list: 7798
> ~octave_value_list: 7799
> ~octave_value_list: 7800
> ~octave_value_list: 7801
> ~octave_value_list: 7802
> ~octave_value_list: 7803
> ~octave_value_list: 7804
> ~octave_value_list: 7805
> ~octave_value_list: 7806
> ~octave_value_list: 7807
> ~octave_value_list: 7808
> ~octave_value_list: 7809
> ~octave_value_list: 7810
> ~octave_value_list: 7811
> ~octave_value_list: 7812
> ~octave_value_list: 7813
> y =  2
> ~octave_value_list: 7814
> ~octave_value_list: 7815
> ~octave_value_list: 7816
> ~octave_value_list: 7817
> ~octave_value_list: 7818
> ~octave_value_list: 7819
>
>
>
Right; I found the same.  In the comments of this patch

https://savannah.gnu.org/patch/?func=detailitem&item_id=8016#comment2

I basically did the same thing--printed when the reference count was
incremented and decremented and when an octave_value was destroyed.  I
did it for the dbstop command, and got a big list of such calls.

At the time, I didn't want to look into why that is.  I was interested
in watching the reference count as the octave_value info was transferred
across the thread.

Another point about using octave_value to transfer data.  By doing so,
other parties could build another type of interface if desired.  I know
there are people out there who have written GUI's for Octave, open
source or not.  With a simplified interface, it doesn't matter too much
if it is Qt on the other side of the link or not.  The only reason I can
think of getting sophisticated with the GUI/core interaction is if the
GUI is to be super-efficient for, say, for simulations or processing low
or medium rate data coming from a device connected to a UART or
something.  But, that is a lot of difficult programming and there aren't
the resources for such things.  I lean toward making the interface
simpler yet flexible.

I'd go with standard library arrays and maps, as well.  They are
designed to be efficient--expanding memory when needed, holding on to it
if the array size shrinks.  Same with QLists and QMaps.

Dan