Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Rik-4
7/18/17

jwe,

In the goal of simplifying the code base and relying more on standard
libraries rather than handspun code, would it make sense to replace the
implementation of octave_local_buffer with std::unique_ptr
(http://www.cplusplus.com/reference/memory/unique_ptr/)?  The purpose of
unique_ptr is to provide an efficient garbage collection mechanism for the
managed pointer such that when the unique_ptr object goes out of scope,
either due to reaching the end of a function or because an exception has
been triggered, any allocated memory is released.

Seems like it would be simple to keep the OCTAVE_LOCAL_BUFFER macro for the
moment but just change the implementation to use unique_ptr.

--Rik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

John W. Eaton
Administrator
On 07/18/2017 01:41 PM, Rik wrote:

> 7/18/17
>
> jwe,
>
> In the goal of simplifying the code base and relying more on standard
> libraries rather than handspun code, would it make sense to replace the
> implementation of octave_local_buffer with std::unique_ptr
> (http://www.cplusplus.com/reference/memory/unique_ptr/)?  The purpose of
> unique_ptr is to provide an efficient garbage collection mechanism for the
> managed pointer such that when the unique_ptr object goes out of scope,
> either due to reaching the end of a function or because an exception has
> been triggered, any allocated memory is released.
>
> Seems like it would be simple to keep the OCTAVE_LOCAL_BUFFER macro for the
> moment but just change the implementation to use unique_ptr.
>
> --Rik

I'd be in favor of doing that.  It's been discussed before and we this
open bug report:

   https://savannah.gnu.org/bugs/?48793

jwe


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Rik-4
On 07/18/2017 11:30 AM, John W. Eaton wrote:
On 07/18/2017 01:41 PM, Rik wrote:
7/18/17

jwe,

In the goal of simplifying the code base and relying more on standard
libraries rather than handspun code, would it make sense to replace the
implementation of octave_local_buffer with std::unique_ptr
(http://www.cplusplus.com/reference/memory/unique_ptr/)?  The purpose of
unique_ptr is to provide an efficient garbage collection mechanism for the
managed pointer such that when the unique_ptr object goes out of scope,
either due to reaching the end of a function or because an exception has
been triggered, any allocated memory is released.

Seems like it would be simple to keep the OCTAVE_LOCAL_BUFFER macro for the
moment but just change the implementation to use unique_ptr.

--Rik

I'd be in favor of doing that.  It's been discussed before and we this open bug report:

  https://savannah.gnu.org/bugs/?48793


It wasn't that hard so I made two different implementations of OCTAVE_LOCAL_BUFFER. One based on bytes, and one based on the passed in object type T. Unfortunately, while they compile just fine, the resulting binary segfaults. I think what this really means is that there are issues in Octave code around double frees of memory.  I've put my work up on the bug report you mentioned.

--Rik


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

John W. Eaton
Administrator
On 07/19/2017 11:25 AM, Rik wrote:

> It wasn't that hard so I made two different implementations of
> OCTAVE_LOCAL_BUFFER. One based on bytes, and one based on the passed in
> object type T. Unfortunately, while they compile just fine, the
> resulting binary segfaults. I think what this really means is that there
> are issues in Octave code around double frees of memory.  I've put my
> work up on the bug report you mentioned.

I also saw a crash on startup with your change.  I think you want to use
something like

#define OCTAVE_LOCAL_BUFFER(T, buf, size)                               \
   std::unique_ptr<T []> octave_local_buffer_ ## buf { new T [size] };   \
   T *buf = octave_local_buffer_ ## buf.get ()

for arrays.  I posted an updated patch to the bug tracker.  Using that
patch, I uncovered another memory error that I think is fixed with this
changeset:

   http://hg.savannah.gnu.org/hgweb/octave/rev/99a9e19bae41

With that, the test suite runs as before.

jwe





Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Rik-4
On 07/19/2017 11:24 AM, John W. Eaton wrote:

> On 07/19/2017 11:25 AM, Rik wrote:
>
>> It wasn't that hard so I made two different implementations of
>> OCTAVE_LOCAL_BUFFER. One based on bytes, and one based on the passed in
>> object type T. Unfortunately, while they compile just fine, the
>> resulting binary segfaults. I think what this really means is that there
>> are issues in Octave code around double frees of memory.  I've put my
>> work up on the bug report you mentioned.
>
> I also saw a crash on startup with your change.  I think you want to use
> something like
>
> #define OCTAVE_LOCAL_BUFFER(T, buf, size)                               \
>   std::unique_ptr<T []> octave_local_buffer_ ## buf { new T [size] };   \
>   T *buf = octave_local_buffer_ ## buf.get ()
>
> for arrays.  I posted an updated patch to the bug tracker.  Using that
> patch, I uncovered another memory error that I think is fixed with this
> changeset:
>
>   http://hg.savannah.gnu.org/hgweb/octave/rev/99a9e19bae41
>
> With that, the test suite runs as before.
>

I tested several different initialization strategies including default,
aggregate, and value initialization and there was not much difference.  I
used default initialization in the end (what is represented by the code
above) because there didn't seem to be a need to zero-initialize POD types
in a temporary buffer.

Also, while looking through the code I found instances where
OCTAVE_LOCAL_BUFFER was being used with constants that are small and fixed
at compile time.  Given that OCTAVE_LOCAL_BUFFER is designed for dynamic,
and possibly large, buffers which are stored on the heap, I believe it
makes sense to replace these instances with local variables which would be
stored on the stack.  As an example,

corefcn/load-save.cc:225:  OCTAVE_LOCAL_BUFFER (unsigned char, magic, 2);

A 2-byte scratchpad just isn't worth the new/delete overhead.

Am I missing something, or should I go ahead and convert these static
buffers to local variables?

--Rik


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

John W. Eaton
Administrator
On 07/20/2017 12:26 PM, Rik wrote:

> I tested several different initialization strategies including default,
> aggregate, and value initialization and there was not much difference.  I
> used default initialization in the end (what is represented by the code
> above) because there didn't seem to be a need to zero-initialize POD types
> in a temporary buffer.

Yes, I'm pretty sure that for POD types, the assumption is that this
memory is not initialized unless the OCTAVE_LOCAL_BUFFER_INIT macro is used.

> Also, while looking through the code I found instances where
> OCTAVE_LOCAL_BUFFER was being used with constants that are small and fixed
> at compile time.  Given that OCTAVE_LOCAL_BUFFER is designed for dynamic,
> and possibly large, buffers which are stored on the heap, I believe it
> makes sense to replace these instances with local variables which would be
> stored on the stack.  As an example,
>
> corefcn/load-save.cc:225:  OCTAVE_LOCAL_BUFFER (unsigned char, magic, 2);
>
> A 2-byte scratchpad just isn't worth the new/delete overhead.

Yes, definitely not worth it and should be replaced with just

   "unsigned char magic[2];"

But this change should only be made in cases where the size really is a
constant.

jwe

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Rik-4
On 07/20/2017 04:39 PM, John W. Eaton wrote:

>
>> Also, while looking through the code I found instances where
>> OCTAVE_LOCAL_BUFFER was being used with constants that are small and fixed
>> at compile time.  Given that OCTAVE_LOCAL_BUFFER is designed for dynamic,
>> and possibly large, buffers which are stored on the heap, I believe it
>> makes sense to replace these instances with local variables which would be
>> stored on the stack.  As an example,
>>
>> corefcn/load-save.cc:225:  OCTAVE_LOCAL_BUFFER (unsigned char, magic, 2);
>>
>> A 2-byte scratchpad just isn't worth the new/delete overhead.
>
> Yes, definitely not worth it and should be replaced with just
>
>   "unsigned char magic[2];"
>
> But this change should only be made in cases where the size really is a
> constant.

There were only a few instances of static buffers which I changed to use
local variables in this cset
(http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).

--Rik

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Daniel Sebald
On 07/21/2017 10:43 AM, Daniel J Sebald wrote:

> On 07/21/2017 10:17 AM, Rik wrote:
>> On 07/21/2017 01:24 AM, Daniel J Sebald wrote:
>>> On 07/20/2017 11:31 PM, Rik wrote:
>>>> On 07/20/2017 04:39 PM, John W. Eaton wrote:
>>> [snip]
>>>>> But this change should only be made in cases where the size really
>>>>> is a
>>>>> constant.
>>>>
>>>> There were only a few instances of static buffers which I changed to
>>>> use
>>>> local variables in this cset
>>>> (http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).
>>>
>>> Rik, at the bottom of this changeset is
>>>
>>> +      octave_idx_type ii = 0;
>>> +      octave_idx_type jj;
>>>        for (jj = 0; jj < (nc - 8 + 1); jj += 8)
>>>          {
>>>            for (ii = 0; ii < (nr - 8 + 1); ii += 8)

[snip]

> Just curious about the use of a function pointer and call and how
> efficient that is.  For example, complex-conjugation can probably be
> done in just a few instruction cycles, whereas repeatedly placing a
> variable on the stack and jumping to a function and consume much more.
> Or does the Template construct deal with all this, i.e., that the T
> *fcn() is treated more like an inline?
>
> There are two different cases here,
>
>   if (nr >= 8 && nc >= 8)
>
> Wouldn't the former case still work if nr or nc were less than 8?
>
> That's an interesting motivation, to avoid cache jumping.  Wouldn't this
> idea still apply if say, nc = 3 and nr = 1000000?

This 8x8 construct has got me thinking a bit.  It's an interesting idea,
very parallel processing oriented in a way, but I'm curious how
efficient this is.  There is still big spacing in this loop

               for (octave_idx_type j = jj, k = 0, idxj = jj * nr;
                    j < jj + 8; j++, idxj += nr)

if nr is extremely large.  Isn't it normally the case that memory is
packed along one dimension versus the other, e.g.,

matrix-index : memory-index
0  0    :       0
1  0    :       1
2  0    :       2
...
999999  :  999999
0, 1    : 1000000
1, 1    : 1000001
2, 1    : 1000002
...
etc.

In other words, we want to make sure we are always placing the inner
index in the direction which the memory is packed contiguous.  So,
rather than have a block that spans 8 rows and 8 columns, don't we want
a block that spans 64 rows (or less) and just one column?  That way we
are more likely to stay within the cache memory.  Am I thinking about
this correctly?

Another thought is that rather than place that function operation within
the loop (if that fcn is in fact compiled a stack/jump/return sequence):

                   result.xelem (j + idxi) = fcn (buf[k]);

maybe it would make more sense to first copy the matrix as a transpose
and then make a call to a function which has arguments NR and NC so that
the called function can do a simple double loop through the whole array
with a register-based fcn equivalent, whether that is complex conjugate
or whatnot.

Dan

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

John W. Eaton
Administrator
In reply to this post by Rik-4
On 07/21/2017 12:31 AM, Rik wrote:

> There were only a few instances of static buffers which I changed to use
> local variables in this cset
> (http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).

Thanks.

I admit to having to read about the meaning of this type of statement
that appears in your patch:

   int arg_used[7] {};

and I'm still a bit confused.  Is there any difference between this
syntax and

   int arg_used[7] = {};

?  They both initialize the elements of the array to 0, correct?

Should we have a preferred style for this?

jwe

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Mike Miller-4
On Fri, Jul 21, 2017 at 12:54:14 -0400, John W. Eaton wrote:

> I admit to having to read about the meaning of this type of statement that
> appears in your patch:
>
>   int arg_used[7] {};
>
> and I'm still a bit confused.  Is there any difference between this syntax
> and
>
>   int arg_used[7] = {};
>
> ?  They both initialize the elements of the array to 0, correct?
>
> Should we have a preferred style for this?

Both are equivalent when the object being initialized is an array.

But the

    Matrix a {};

syntax also applies when the object being initialized is a class type or
a non-class type.

Personally I try to use the brace form without equals sign for both
arrays and class types now.

I refer to this page a lot

  http://en.cppreference.com/w/cpp/language/initialization

--
mike

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: initialization style

Rik-4
In reply to this post by John W. Eaton
On 07/21/2017 09:54 AM, John W. Eaton wrote:

> On 07/21/2017 12:31 AM, Rik wrote:
>
>> There were only a few instances of static buffers which I changed to use
>> local variables in this cset
>> (http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).
>
> Thanks.
>
> I admit to having to read about the meaning of this type of statement
> that appears in your patch:
>
>   int arg_used[7] {};
>
> and I'm still a bit confused.  Is there any difference between this
> syntax and
>
>   int arg_used[7] = {};
>
> ?  They both initialize the elements of the array to 0, correct?
>
> Should we have a preferred style for this?

Probably we should have a style.  I did some research and came back
confused enough that I simply went with the dominant convention that I saw
in other code.  However, I can say a few things.  The '=' can be important
because it will generally invoke copy-initialization while the use of
parentheses will generally invoke direct-initialization.  The use of curly
braces can mean both things.  For a non-aggregate entity it seems to do
direct-initialization, but if I read the specification correctly all
aggregate objects are always copy-initialized from the elements within the
curly braces.  For arrays, as in the example, I think the assignment
operator '=' simply re-affirms what is already happening.  And yes, there
are some differences between direct-initialization and copy-initialization
when it comes down to the universe of constructors that are considered for
matching.  For a POD type like an int, I don't think it matters, but if you
want to have a general convention then we should consider the effects of
choosing one syntax versus the other.

References:
http://en.cppreference.com/w/cpp/language/direct_initialization
http://en.cppreference.com/w/cpp/language/aggregate_initialization

--Rik


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?

Daniel Sebald
In reply to this post by Daniel Sebald
On 07/21/2017 11:34 AM, Daniel J Sebald wrote:> On 07/21/2017 10:43 AM,
Daniel J Sebald wrote:

>> On 07/21/2017 10:17 AM, Rik wrote:
>>> On 07/21/2017 01:24 AM, Daniel J Sebald wrote:
>>>> On 07/20/2017 11:31 PM, Rik wrote:
>>>>> On 07/20/2017 04:39 PM, John W. Eaton wrote:
>>>> [snip]
>>>>>> But this change should only be made in cases where the size really
>>>>>> is a
>>>>>> constant.
>>>>>
>>>>> There were only a few instances of static buffers which I changed to
>>>>> use
>>>>> local variables in this cset
>>>>> (http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).
>>>>
>>>> Rik, at the bottom of this changeset is
>>>>
>>>> +      octave_idx_type ii = 0;
>>>> +      octave_idx_type jj;
>>>>        for (jj = 0; jj < (nc - 8 + 1); jj += 8)
>>>>          {
>>>>            for (ii = 0; ii < (nr - 8 + 1); ii += 8)
>
> [snip]
>
>> Just curious about the use of a function pointer and call and how
>> efficient that is.  For example, complex-conjugation can probably be
>> done in just a few instruction cycles, whereas repeatedly placing a
>> variable on the stack and jumping to a function and consume much more.
>> Or does the Template construct deal with all this, i.e., that the T
>> *fcn() is treated more like an inline?
>>
>> There are two different cases here,
>>
>>   if (nr >= 8 && nc >= 8)
>>
>> Wouldn't the former case still work if nr or nc were less than 8?
>>
>> That's an interesting motivation, to avoid cache jumping.  Wouldn't this
>> idea still apply if say, nc = 3 and nr = 1000000?
>
> This 8x8 construct has got me thinking a bit.  It's an interesting idea,
> very parallel processing oriented in a way, but I'm curious how
> efficient this is.  There is still big spacing in this loop
>
>               for (octave_idx_type j = jj, k = 0, idxj = jj * nr;
>                    j < jj + 8; j++, idxj += nr)
>
> if nr is extremely large.  Isn't it normally the case that memory is
> packed along one dimension versus the other, e.g.,
>
> matrix-index : memory-index
> 0  0    :       0
> 1  0    :       1
> 2  0    :       2
> ...
> 999999  :  999999
> 0, 1    : 1000000
> 1, 1    : 1000001
> 2, 1    : 1000002
> ...
> etc.
>
> In other words, we want to make sure we are always placing the inner
> index in the direction which the memory is packed contiguous.  So,
> rather than have a block that spans 8 rows and 8 columns, don't we want
> a block that spans 64 rows (or less) and just one column?  That way we
> are more likely to stay within the cache memory.  Am I thinking about
> this correctly?
>
> Another thought is that rather than place that function operation within
> the loop (if that fcn is in fact compiled a stack/jump/return sequence):
>
>                   result.xelem (j + idxi) = fcn (buf[k]);
>
> maybe it would make more sense to first copy the matrix as a transpose
> and then make a call to a function which has arguments NR and NC so that
> the called function can do a simple double loop through the whole array
> with a register-based fcn equivalent, whether that is complex conjugate
> or whatnot.

I tried a lot of variations on the cache-friendly block construct and
didn't find anything faster.  For example, rearranging the second loop
and incorporating it into the first loop so that no intermediate buffer
is used resulted in a slight loss of performance.

However, I did realize that the case of large NR compared to NC takes a
big hit because the last partial columns don't do blocking (it's a
separate loop) and it is likely those last columns which are most likely
to be the ones suffering from cache-jumping.  So I've moved those extra
loops within the main two loops.

The result is much more streamlined code and equally fast in all
scenarios of NR x NC:

https://savannah.gnu.org/patch/index.php?9415

Dan


Loading...