segfaults building documentation when machine under load

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Daniel Sebald
On 5/19/20 9:54 PM, John W. Eaton wrote:

> On 5/19/20 4:11 PM, Dmitri A. Sergatskov wrote:
>>
>>
>> On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:
>>
>>      >     Should we switch to bug-tracker?
>>      >     I was able to get a crash when I bumped the jobs to 200.
>>      >     bt is attached. The relevant part seems to be:
>>
>>     If I use a large number of jobs, I see
>>
>>         error: imwrite: invalid empty image
>>         error: called from
>>             __imwrite__ at line 40 column 5
>>             imwrite at line 125 column 5
>>             print at line 755 column 13
>>             interpimages at line 72 column 5
>>
>>     but no segfaults.
>>
>>     It does look like a threading issue.
>>
>>
>> I used a simplified test by Andreas:
>>
>> parallel -N0 -q octave --norc --silent --no-history --eval 'figure
>> (1,"visible", "off");' ::: {1..200}
>
> OK, I'm able to duplicate the problem using this method.
>
> In the stack traces I've seen, Octave is crashing inside the interpreter
> object destructor while attempting to close any remaining figures.
>
> Changing the eval above to be
>
>    figure (1, "visible", "off"); close ("all"); pause (1);
>
> eliminates the crash for me, apparently because then there are no figure
> windows to close when exiting.  But attempting the same thing in the
> doc/interpreter scripts that generate plots I see the "invalid empty
> image" error on every attempt to create a figure, at least when using a
> large number of parallel Make jobs.
>
> So clearly this kind of change is not a solution, but it may point us
> toward one.  Ultimately, we need to determine the correct sequence for
> shutting down the GUI and interpreter, including what actions can happen
> or need to be blocked, and what signals need to be sent or locks
> acquired so that there are no more races between the threads.
>
> It's a bit tricky because figures can have callbacks set to run when the
> figure is closed.  Do we expect those to run when Octave is in the
> process of exiting or is it OK to skip them?  Those functions could
> register code to run when Octave exits.  Should that be possible?  Would
> it be OK for an atexit function to display a graph?  What is reasonable
> to expect or attempt to do?

Are a "close" command and upper-right close button the same path,
essentially?  So it is

1) "close"
2) call close callback
3) destroy figure object

?  What is the route when Octave exits?  Also, is the graphics engine
still valid in the exit scenario?  At the end of

void
gh_manager::execute_callback (const graphics_handle& h,
                               const octave_value& cb_arg,
                               const octave_value& data)
{

is the following:

       // Redraw after interacting with a user-interface (ui*) object.
       if (Vdrawnow_requested)
         {
           if (go)
             {
               std::string go_name
                 = go.get_properties ().graphics_object_name ();

               if (go_name.length () > 1
                   && go_name[0] == 'u' && go_name[1] == 'i')
                 {
                   Fdrawnow (m_interpreter);
                   Vdrawnow_requested = false;
                 }
             }
         }

Perhaps Fdrawnow() is where the failure is happening because the
graphics was shut down prior.  To debug, place a

printf("About to redraw...")
                   Fdrawnow (m_interpreter);
printf("...finished redraw")

and retry the doc build.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

John W. Eaton
Administrator
In reply to this post by Dmitri A. Sergatskov
On 5/19/20 4:11 PM, Dmitri A. Sergatskov wrote:

>
>
> On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:
>
>      >     Should we switch to bug-tracker?
>      >     I was able to get a crash when I bumped the jobs to 200.
>      >     bt is attached. The relevant part seems to be:
>
>     If I use a large number of jobs, I see
>
>         error: imwrite: invalid empty image
>         error: called from
>             __imwrite__ at line 40 column 5
>             imwrite at line 125 column 5
>             print at line 755 column 13
>             interpimages at line 72 column 5
>
>     but no segfaults.
>
>     It does look like a threading issue.
>
>
> I used a simplified test by Andreas:
>
> parallel -N0 -q octave --norc --silent --no-history --eval 'figure
> (1,"visible", "off");' ::: {1..200}

Thanks.

After much confusion, I think I arrived at a solution.  I pushed the
following changeset to stable and merged with default:

   http://hg.savannah.gnu.org/hgweb/octave/rev/00a9a49c7670

on stable and merged with default.

These most recent changes appear to improve the situation for the test
case shown above.  I'm not longer able to cause a segfault with the
following parallel execution:

     parallel -j 50 -N0 -q octave --norc --silent --no-history --eval
'figure (1, "visible", "off");' ::: {1..1000}

Here's the summary from the changset commit message:

----
This change is a further attempt to avoid segfaults when shutting down
the interpreter and exiting the GUI event loop.  The latest approach is
to have the interpreter signal that it is finished with "normal" command
execution (REPL, command line script, or --eval option code), then let
the GUI thread process any remaining functions in its event loop(s) then
signal back to the interpreter that it is OK to shutdown.  Once the
shutdown has happened (which may involve further calls to the GUI thread
while executing atexit functions or finish.m or other shutdown code, the
interpreter signals back to the GUI that shutdown is complete.  At that
point, the GUI can delete the interpreter object and exit.
----

Before this change, the GUI could still be processing events (displaying
the figure window, for example) while the interpreter was being deleted.
  Obviously, that causes trouble.

Although we recognized this problem before, none of the previous
solutions have really worked.  See the commit message for
https://hg.savannah.gnu.org/hgweb/octave/rev/cdb681adc85a, for example,
where I noted that

   ... the crash described in bug report #56952 appeared to be happening
when the Qt event loop was calling
QtHandles::qt_graphics_toolkit::create_object when the interpreter was
being deleted and the gh_manager object was already invalid, ...

I noticed this again and finally realized that we could probably use the
Qt event queue to ensure that pending graphics events are allowed to
finish before shutting down the interpreter.  It seems to work for all
the tests I've tried so far, including creating a figure in the finish.m
script or using "atexit ('sombrero')".

jwe

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Daniel Sebald
On 5/22/20 4:52 PM, John W. Eaton wrote:

> On 5/19/20 4:11 PM, Dmitri A. Sergatskov wrote:
>>
>>
>> On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:
>>
>>      >     Should we switch to bug-tracker?
>>      >     I was able to get a crash when I bumped the jobs to 200.
>>      >     bt is attached. The relevant part seems to be:
>>
>>     If I use a large number of jobs, I see
>>
>>         error: imwrite: invalid empty image
>>         error: called from
>>             __imwrite__ at line 40 column 5
>>             imwrite at line 125 column 5
>>             print at line 755 column 13
>>             interpimages at line 72 column 5
>>
>>     but no segfaults.
>>
>>     It does look like a threading issue.
>>
>>
>> I used a simplified test by Andreas:
>>
>> parallel -N0 -q octave --norc --silent --no-history --eval 'figure
>> (1,"visible", "off");' ::: {1..200}
>
> Thanks.
>
> After much confusion, I think I arrived at a solution.  I pushed the
> following changeset to stable and merged with default:
>
>    http://hg.savannah.gnu.org/hgweb/octave/rev/00a9a49c7670
>
> on stable and merged with default.
>
> These most recent changes appear to improve the situation for the test
> case shown above.  I'm not longer able to cause a segfault with the
> following parallel execution:
>
>      parallel -j 50 -N0 -q octave --norc --silent --no-history --eval
> 'figure (1, "visible", "off");' ::: {1..1000}
>
> Here's the summary from the changset commit message:
>
> ----
> This change is a further attempt to avoid segfaults when shutting down
> the interpreter and exiting the GUI event loop.  The latest approach is
> to have the interpreter signal that it is finished with "normal" command
> execution (REPL, command line script, or --eval option code), then let
> the GUI thread process any remaining functions in its event loop(s) then
> signal back to the interpreter that it is OK to shutdown.  Once the
> shutdown has happened (which may involve further calls to the GUI thread
> while executing atexit functions or finish.m or other shutdown code, the
> interpreter signals back to the GUI that shutdown is complete.  At that
> point, the GUI can delete the interpreter object and exit.
> ----
>
> Before this change, the GUI could still be processing events (displaying
> the figure window, for example) while the interpreter was being deleted.
>   Obviously, that causes trouble.
>
> Although we recognized this problem before, none of the previous
> solutions have really worked.  See the commit message for
> https://hg.savannah.gnu.org/hgweb/octave/rev/cdb681adc85a, for example,
> where I noted that
>
>    ... the crash described in bug report #56952 appeared to be happening
> when the Qt event loop was calling
> QtHandles::qt_graphics_toolkit::create_object when the interpreter was
> being deleted and the gh_manager object was already invalid, ...
>
> I noticed this again and finally realized that we could probably use the
> Qt event queue to ensure that pending graphics events are allowed to
> finish before shutting down the interpreter.  It seems to work for all
> the tests I've tried so far, including creating a figure in the finish.m
> script or using "atexit ('sombrero')".

Some time ago a group of us looked at the problem of exiting the GUI
when the worker core is busy:

https://savannah.gnu.org/bugs/?44485

I had put some effort into a nice system whereby a QTimer waits for the
core to finish and after a certain amount of time it would signal that a
dialog box appear asking if the user wants to force an exit.  Of course,
if the core does then quit while the user hasn't answered the dialog yet
then the dialog box should disappear.  It all had to do with saving
files in the editor and closing the editor and so on.

However, I never completed the patch because I could never get the
sequencing just right.  There was always something like "What if the
user does this?", or "What if the core finishes at this point?".  This
shutdown signal might be just the thing to make it work.  I'll revisit
that bug when I can.

Dan

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Sat, May 23, 2020 at 12:24 AM Daniel J Sebald <[hidden email]> wrote:
On 5/22/20 4:52 PM, John W. Eaton wrote:
> On 5/19/20 4:11 PM, Dmitri A. Sergatskov wrote:
>>
>>
>> On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]
>> <mailto:[hidden email]>> wrote:
>>
>>     On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:
>>
>>      >     Should we switch to bug-tracker?
>>      >     I was able to get a crash when I bumped the jobs to 200.
>>      >     bt is attached. The relevant part seems to be:
>>
>>     If I use a large number of jobs, I see
>>
>>         error: imwrite: invalid empty image
>>         error: called from
>>             __imwrite__ at line 40 column 5
>>             imwrite at line 125 column 5
>>             print at line 755 column 13
>>             interpimages at line 72 column 5
>>
>>     but no segfaults.
>>
>>     It does look like a threading issue.
>>
>>
>> I used a simplified test by Andreas:
>>
>> parallel -N0 -q octave --norc --silent --no-history --eval 'figure
>> (1,"visible", "off");' ::: {1..200}
>
> Thanks.
>
> After much confusion, I think I arrived at a solution.  I pushed the
> following changeset to stable and merged with default:
>
>    http://hg.savannah.gnu.org/hgweb/octave/rev/00a9a49c7670
>
> on stable and merged with default.
>
> These most recent changes appear to improve the situation for the test
> case shown above.  I'm not longer able to cause a segfault with the
> following parallel execution:
>
>      parallel -j 50 -N0 -q octave --norc --silent --no-history --eval
> 'figure (1, "visible", "off");' ::: {1..1000}
>
> Here's the summary from the changset commit message:
>
> ----
> This change is a further attempt to avoid segfaults when shutting down
> the interpreter and exiting the GUI event loop.  The latest approach is
> to have the interpreter signal that it is finished with "normal" command
> execution (REPL, command line script, or --eval option code), then let
> the GUI thread process any remaining functions in its event loop(s) then
> signal back to the interpreter that it is OK to shutdown.  Once the
> shutdown has happened (which may involve further calls to the GUI thread
> while executing atexit functions or finish.m or other shutdown code, the
> interpreter signals back to the GUI that shutdown is complete.  At that
> point, the GUI can delete the interpreter object and exit.
> ----
>
> Before this change, the GUI could still be processing events (displaying
> the figure window, for example) while the interpreter was being deleted.
>   Obviously, that causes trouble.
>
> Although we recognized this problem before, none of the previous
> solutions have really worked.  See the commit message for
> https://hg.savannah.gnu.org/hgweb/octave/rev/cdb681adc85a, for example,
> where I noted that
>
>    ... the crash described in bug report #56952 appeared to be happening
> when the Qt event loop was calling
> QtHandles::qt_graphics_toolkit::create_object when the interpreter was
> being deleted and the gh_manager object was already invalid, ...
>
> I noticed this again and finally realized that we could probably use the
> Qt event queue to ensure that pending graphics events are allowed to
> finish before shutting down the interpreter.  It seems to work for all
> the tests I've tried so far, including creating a figure in the finish.m
> script or using "atexit ('sombrero')".

Some time ago a group of us looked at the problem of exiting the GUI
when the worker core is busy:

https://savannah.gnu.org/bugs/?44485

I had put some effort into a nice system whereby a QTimer waits for the
core to finish and after a certain amount of time it would signal that a
dialog box appear asking if the user wants to force an exit.  Of course,
if the core does then quit while the user hasn't answered the dialog yet
then the dialog box should disappear.  It all had to do with saving
files in the editor and closing the editor and so on.

However, I never completed the patch because I could never get the
sequencing just right.  There was always something like "What if the
user does this?", or "What if the core finishes at this point?".  This
shutdown signal might be just the thing to make it work.  I'll revisit
that bug when I can.

Dan

I posted this on the bug list, but perhaps it worth to repost it here.
After the latest John's patch (c6d10df71863 tip @) the segfault crash is gone.
The failed builds are due to missing files.
I tried the following test with parallel:

rm -rf /tmp/t1/*

parallel -j 32 -N0 -q ./run-octave --norc --silent --no-history --eval 'figure(1, "visible", "off"); plot (1:2); print(tempname("/tmp/t1", "t1-"));' ::: {1..128}

ls -c /tmp/t1/ | wc -l
92

I expect to have 128 files in /tmp/t1; the actual number varies from run to run. Adding pause(1) after plot and print
improves the situation, but does not solves it. Also increasing number of jobs seems to make it worse.
But may be i am not using parallel correctly.

Sincerely,

Dmitri.
--


 
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

mmuetzel
In reply to this post by John W. Eaton
Am Freitag, 22. Mai 2020 um 22:52 Uhr schrieb "John W. Eaton":

> After much confusion, I think I arrived at a solution.  I pushed the
> following changeset to stable and merged with default:
>
>    http://hg.savannah.gnu.org/hgweb/octave/rev/00a9a49c7670
>
> on stable and merged with default.
>
> These most recent changes appear to improve the situation for the test
> case shown above.  I'm not longer able to cause a segfault with the
> following parallel execution:
>
>      parallel -j 50 -N0 -q octave --norc --silent --no-history --eval
> 'figure (1, "visible", "off");' ::: {1..1000}
>
> Here's the summary from the changset commit message:
>
> ----
> This change is a further attempt to avoid segfaults when shutting down
> the interpreter and exiting the GUI event loop.  The latest approach is
> to have the interpreter signal that it is finished with "normal" command
> execution (REPL, command line script, or --eval option code), then let
> the GUI thread process any remaining functions in its event loop(s) then
> signal back to the interpreter that it is OK to shutdown.  Once the
> shutdown has happened (which may involve further calls to the GUI thread
> while executing atexit functions or finish.m or other shutdown code, the
> interpreter signals back to the GUI that shutdown is complete.  At that
> point, the GUI can delete the interpreter object and exit.
> ----
>
> Before this change, the GUI could still be processing events (displaying
> the figure window, for example) while the interpreter was being deleted.
>   Obviously, that causes trouble.
>
> Although we recognized this problem before, none of the previous
> solutions have really worked.  See the commit message for
> https://hg.savannah.gnu.org/hgweb/octave/rev/cdb681adc85a, for example,
> where I noted that
>
>    ... the crash described in bug report #56952 appeared to be happening
> when the Qt event loop was calling
> QtHandles::qt_graphics_toolkit::create_object when the interpreter was
> being deleted and the gh_manager object was already invalid, ...
>
> I noticed this again and finally realized that we could probably use the
> Qt event queue to ensure that pending graphics events are allowed to
> finish before shutting down the interpreter.  It seems to work for all
> the tests I've tried so far, including creating a figure in the finish.m
> script or using "atexit ('sombrero')".

The Fedora buildbots seem to stop after the recent changes when they try to create the plots for the manual.
E.g.:
http://buildbot.octave.org:8010/#/builders/25/builds/1547/steps/5/logs/stdio

/bin/sh run-octave --norc --silent --no-history --path /home/buildbotu/fc25-x86_64/gcc-lto-fedora/build/../src/doc/interpreter/ --eval "geometryimages ('doc/interpreter/', 'voronoi', 'eps');"
/bin/sh run-octave --norc --silent --no-history --path /home/buildbotu/fc25-x86_64/gcc-lto-fedora/build/../src/doc/interpreter/ --eval "geometryimages ('doc/interpreter/', 'triplot', 'eps');"
/bin/sh run-octave --norc --silent --no-history --path /home/buildbotu/fc25-x86_64/gcc-lto-fedora/build/../src/doc/interpreter/ --eval "geometryimages ('doc/interpreter/', 'griddata', 'eps');"
/bin/sh run-octave --norc --silent --no-history --path /home/buildbotu/fc25-x86_64/gcc-lto-fedora/build/../src/doc/interpreter/ --eval "geometryimages ('doc/interpreter/', 'convhull', 'eps');"
command timed out: 1200 seconds without output running [b'nice', b'-n', b'10', b'make', b'V=1', b'-j4'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2309.509019

Maybe the threads wait for a signal from each other and got stuck?

Are the Fedora buildbots headless like the Debian ones?

Markus



Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

chloros
or using "atexit ('sombrero')".
>
> The Fedora buildbots seem to stop after the recent changes when they
> try to create the plots for the manual.
> E.g.:
> http://buildbot.octave.org:8010/#/builders/25/builds/1547/steps/5/logs/stdio

Hi Markus,

see my mail from yesterday. Reason is in configure.ac $opengl_graphics
is used before it is set and therefore gl2ps makes trouble when
building the documentation. It came with changeset: 28356:4e4baa5ac03c

Best regards



Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov
In reply to this post by mmuetzel


On Sun, May 24, 2020 at 4:10 AM "Markus Mützel" <[hidden email]> wrote:

Are the Fedora buildbots headless like the Debian ones?

I suspect that edora use gl backend for building docs and most others -- gnuplot.
It is hard to tell this from logs.

 

Markus


Dmitri.
--
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

mmuetzel
Am 24. Mai 2020 um 11:43 Uhr schrieb "Dmitri A. Sergatskov":
> On Sun, May 24, 2020 at 4:10 AM "Markus Mützel" wrote:
> > Are the Fedora buildbots headless like the Debian ones?

>
> I suspect that edora use gl backend for building docs and most others -- gnuplot.
> It is hard to tell this from logs.

So could the reason why the Fedora buildbots have more issues building the documentation be that they print with the GL backend instead of gnuplot?
Maybe there are problems with that code path (that is not used by the Debian buildbots).

Markus

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Sun, May 24, 2020 at 5:52 AM "Markus Mützel" <[hidden email]> wrote:
Am 24. Mai 2020 um 11:43 Uhr schrieb "Dmitri A. Sergatskov":
> On Sun, May 24, 2020 at 4:10 AM "Markus Mützel" wrote:
> > Are the Fedora buildbots headless like the Debian ones?

>
> I suspect that edora use gl backend for building docs and most others -- gnuplot.
> It is hard to tell this from logs.

So could the reason why the Fedora buildbots have more issues building the documentation be that they print with the GL backend instead of gnuplot?
Maybe there are problems with that code path (that is not used by the Debian buildbots).

Yes, the problem with opengl backend. There are a couple posts by Mike M. and myself about the issue.


Markus

Dmitri.
--
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Sun, May 24, 2020 at 6:06 AM Dmitri A. Sergatskov <[hidden email]> wrote:


On Sun, May 24, 2020 at 5:52 AM "Markus Mützel" <[hidden email]> wrote:
Am 24. Mai 2020 um 11:43 Uhr schrieb "Dmitri A. Sergatskov":
> On Sun, May 24, 2020 at 4:10 AM "Markus Mützel" wrote:
> > Are the Fedora buildbots headless like the Debian ones?

>
> I suspect that edora use gl backend for building docs and most others -- gnuplot.
> It is hard to tell this from logs.

So could the reason why the Fedora buildbots have more issues building the documentation be that they print with the GL backend instead of gnuplot?
Maybe there are problems with that code path (that is not used by the Debian buildbots).

Yes, the problem with opengl backend. There are a couple posts by Mike M. and myself about the issue.


Markus

Dmitri.
--

E.g. see comments 33 and 34 in

Dmitri.
--

12