segfaults building documentation when machine under load

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

segfaults building documentation when machine under load

Rik-4
I'm getting a vaguely repeatable situation where building the documentation
fails when the machine doing the work is under stress.

Example errors:

/bin/bash: line 1: 24234 Segmentation fault      (core dumped) /bin/bash
run-octave --norc --silent --no-history --path
/home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval
"interpimages ('doc/interpreter/', 'interpft', 'txt');"
Makefile:27944: recipe for target 'doc/interpreter/interpft.txt' failed
make[2]: *** [doc/interpreter/interpft.txt] Error 139
make[2]: *** Waiting for unfinished jobs....
fatal: caught signal Segmentation fault -- stopping myself...
fatal: caught signal Segmentation fault -- stopping myself...
fatal: caught signal Segmentation fault -- stopping myself...
/bin/bash: line 1: 25316 Segmentation fault      (core dumped) /bin/bash
run-octave --norc --silent --no-history --path
/home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval
"interpimages ('doc/interpreter/', 'interpderiv2', 'txt');"
Makefile:27950: recipe for target 'doc/interpreter/interpderiv2.txt' failed
make[2]: *** [doc/interpreter/interpderiv2.txt] Error 139
/bin/bash: line 1: 25338 Segmentation fault      (core dumped) /bin/bash
run-octave --norc --silent --no-history --path
/home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval "plotimages
('doc/interpreter/', 'hist', 'txt');"
Makefile:27996: recipe for target 'doc/interpreter/hist.txt' failed
make[2]: *** [doc/interpreter/hist.txt] Error 139

Are other people experiencing this as well?  I think I saw something about
the Fedora buildbots also having this issue.

To be sure, I happen to have my local machine stressed.  6 of 8 cores are
pegged and then I am running 'make -j8' to do the build.  Also, uptime
reports an average load of 10.1.

A possible clue is that this usually happens when generating text files,
rather than when trying to generate actual images like png or pdf.  The
text is generated very quickly which means that race conditions might
become more apparent.  Is the script run-octave safe for parallel execution?

--Rik


Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

José Abílio Matos
On Tuesday, 17 December 2019 15.55.14 WET Rik wrote:
> A possible clue is that this usually happens when generating text files,
> rather than when trying to generate actual images like png or pdf.  The
> text is generated very quickly which means that race conditions might
> become more apparent.  Is the script run-octave safe for parallel execution?
>
> --Rik

Using Fedora 31, I compile octave using on a 4 cores machine (with
hypertheading):

make -j3

things get slow mostly at some stages (when the linker is involved?) because
the memory consumption becomes high (this machine has 8 GB of RAM).

But I never noticed the crashes that you report.

Regards,
--
José Matos



Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov
In reply to this post by Rik-4
On Tue, Dec 17, 2019 at 9:56 AM Rik <[hidden email]> wrote:

>
> I'm getting a vaguely repeatable situation where building the documentation
> fails when the machine doing the work is under stress.
>
> Example errors:
>
> /bin/bash: line 1: 24234 Segmentation fault      (core dumped) /bin/bash
> run-octave --norc --silent --no-history --path
> /home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval
> "interpimages ('doc/interpreter/', 'interpft', 'txt');"
> Makefile:27944: recipe for target 'doc/interpreter/interpft.txt' failed
> make[2]: *** [doc/interpreter/interpft.txt] Error 139
> make[2]: *** Waiting for unfinished jobs....
> fatal: caught signal Segmentation fault -- stopping myself...
> fatal: caught signal Segmentation fault -- stopping myself...
> fatal: caught signal Segmentation fault -- stopping myself...
> /bin/bash: line 1: 25316 Segmentation fault      (core dumped) /bin/bash
> run-octave --norc --silent --no-history --path
> /home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval
> "interpimages ('doc/interpreter/', 'interpderiv2', 'txt');"
> Makefile:27950: recipe for target 'doc/interpreter/interpderiv2.txt' failed
> make[2]: *** [doc/interpreter/interpderiv2.txt] Error 139
> /bin/bash: line 1: 25338 Segmentation fault      (core dumped) /bin/bash
> run-octave --norc --silent --no-history --path
> /home/rik/wip/Projects_Mine/octave-dev/doc/interpreter/ --eval "plotimages
> ('doc/interpreter/', 'hist', 'txt');"
> Makefile:27996: recipe for target 'doc/interpreter/hist.txt' failed
> make[2]: *** [doc/interpreter/hist.txt] Error 139
>
> Are other people experiencing this as well?  I think I saw something about
> the Fedora buildbots also having this issue.
>
> To be sure, I happen to have my local machine stressed.  6 of 8 cores are
> pegged and then I am running 'make -j8' to do the build.  Also, uptime
> reports an average load of 10.1.
>
> A possible clue is that this usually happens when generating text files,
> rather than when trying to generate actual images like png or pdf.  The
> text is generated very quickly which means that race conditions might
> become more apparent.  Is the script run-octave safe for parallel execution?
>
> --Rik
>
>

yeah, we have been talking about it for years :)
Most of the Fedora buildbot failures are due to the same error. I get it on my
workstation most of the time. I get it on my laptop (with clear linux)
most of the time.
I do not think it is a load issue, it is more like race condition /
timing issue.

Dmitri.

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov
In reply to this post by José Abílio Matos
On Tue, Dec 17, 2019 at 10:13 AM José Abílio Matos <[hidden email]> wrote:

> But I never noticed the crashes that you report.
>
> Regards,
> --
> José Matos
>

Are those incremental builds?

After you built try
rm -rf doc/ ; make -j4

Dmitri.

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Juan Pablo Carbajal-2
In reply to this post by Rik-4
I do not ge the same error but

Makefile:27896: recipe for target 'doc/interpreter/voronoi.png' failed
make[2]: *** [doc/interpreter/voronoi.png] Error 1
make[2]: *** Waiting for unfinished jobs....
error: '__octave_link_enabled__' undefined near line 5, column 5
error: called from
    /home/juanpi/Devel/octave/build-default/libgui/graphics/PKG_ADD at
line 5 column 3
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 34 column 5
    imwrite at line 119 column 5
    print at line 748 column 13
    geometryimages at line 72 column 5
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 34 column 5
    imwrite at line 119 column 5
    print at line 748 column 13
    geometryimages at line 79 column 5

but running a couple of times solves the issue.

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

José Abílio Matos
In reply to this post by Dmitri A. Sergatskov
On Tuesday, 17 December 2019 16.25.34 WET Dmitri A. Sergatskov wrote:
> Are those incremental builds?

Yes.
 
> After you built try
> rm -rf doc/ ; make -j4
>
> Dmitri.

I tried now but it worked. :-)

But now that Juan Pablo mentioned a failure in voronoi.png I remember to got
one of those at some random compilation.

Since it succeeded the next time I ignored it and I have never reported it
because since I do incremental builds that could be a bad transient state.

So it seems that I also get problem but so few times that I forgot it. :-)
--
José Matos



Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Juan Pablo Carbajal-2
> But now that Juan Pablo mentioned a failure in voronoi.png I remember to got
> one of those at some random compilation.

The error is triggered randomly by almost all .png and only when I use
more than one job.
Doing "doc -rf doc/" before compiling doesn't seem to prevent the error.
Here is another case

  GEN      doc/interpreter/interpderiv1.png
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 34 column 5
    imwrite at line 119 column 5
    print at line 748 column 13
    geometryimages at line 99 column 5
Makefile:27906: recipe for target 'doc/interpreter/inpolygon.png' failed
make[2]: *** [doc/interpreter/inpolygon.png] Error 1
make[2]: *** Waiting for unfinished jobs....
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 34 column 5
    imwrite at line 119 column 5
    print at line 748 column 13
    interpimages at line 54 column 5
Makefile:27938: recipe for target 'doc/interpreter/interpn.png' failed
make[2]: *** [doc/interpreter/interpn.png] Error 1
make[2]: Leaving directory '/home/juanpi/Devel/octave/builds/default'
Makefile:26374: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/home/juanpi/Devel/octave/builds/default'
Makefile:9958: recipe for target 'all' failed
make: *** [all] Error 2

From the error message I guess this is a problem of one thread still
writing the image while another trying to use it.

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Andreas Weber-6
In reply to this post by Rik-4
Am 17.12.19 um 16:55 schrieb Rik:
> I'm getting a vaguely repeatable situation where building the documentation
> fails when the machine doing the work is under stress.

I can reproduce this with any "txt" output using GNU parallel, for
example (from the build directory):

parallel -N0 -q ./run-octave --norc --silent --no-history --path
../octave-src/doc/interpreter/ --eval "plotimages ('doc/interpreter/',
'hist', 'txt');" ::: {1..50}

This generates segfaults with plotimages, sparseimages, splineimages and
so on...

I'll build now with debugging symbols

hg id be3dab3212e9 on debian buster
-- Andy

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Andreas Weber-6
Am 19.05.20 um 11:04 schrieb Andreas Weber:
> I can reproduce this with any "txt" output using GNU parallel, for
> example (from the build directory):
>
> parallel -N0 -q ./run-octave --norc --silent --no-history --path
> ../octave-src/doc/interpreter/ --eval "plotimages ('doc/interpreter/',
> 'hist', 'txt');" ::: {1..50}
>
> This generates segfaults with plotimages, sparseimages, splineimages and
> so on...

I'm not able to reproduce it with debugging symbols... but here is gdb
without without symbols:

Thread 1 "octave-gui" received signal SIGSEGV, Segmentation fault.
0x00007ffff3ccb592 in QMetaObject::invokeMethod(QObject*, char const*,
Qt::ConnectionType, QGenericReturnArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument,
QGenericArgument) ()
   from /usr/lib/x86_64-linux-gnu/libQt5Core.so.5

(gdb) bt
#0  0x00007ffff3ccb592 in QMetaObject::invokeMethod(QObject*, char
const*, Qt::ConnectionType, QGenericReturnArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument,
QGenericArgument) ()
    at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#1  0x00007ffff7d59320 in QMetaObject::invokeMethod(QObject*, char
const*, Qt::ConnectionType, QGenericArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument,
QGenericArgument, QGenericArgument, QGenericArgument, QGenericArgument)
    (val9=..., val8=..., val7=..., val6=..., val5=..., val4=...,
val3=..., val2=..., val1=..., val0=..., type=<optimized out>,
member=0x7ffff7e55d8b "slotFinalize", obj=<optimized out>) at
/usr/include/x86_64-linux-gnu/qt5/QtCore/qobjectdefs.h:444
#2  0x00007ffff7d59320 in QtHandles::ObjectProxy::finalize()
(this=0x7fffbc4acaf0) at ../octave-src/libgui/graphics/ObjectProxy.cc:110
#3  0x00007ffff7d59320 in QtHandles::ObjectProxy::finalize()
(this=0x7fffbc4acaf0) at ../octave-src/libgui/graphics/ObjectProxy.cc:100
#4  0x00007ffff7d59351 in
QtHandles::ObjectProxy::setObject(QtHandles::Object*)
(this=this@entry=0x7fffbc4acaf0, obj=obj@entry=0x5555557e48b0)
    at ../octave-src/libgui/graphics/ObjectProxy.cc:86
#5  0x00007ffff7d8a795 in
QtHandles::qt_graphics_toolkit::create_object(double)
(this=0x7fffbc16b850, handle=-31.839112234376785)
    at ../octave-src/libgui/graphics/qt-graphics-toolkit.cc:452
#6  0x00007ffff3ce5072 in QObject::event(QEvent*) () at
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#7  0x00007ffff46384c1 in QApplicationPrivate::notify_helper(QObject*,
QEvent*) () at /usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5
#8  0x00007ffff463f970 in QApplication::notify(QObject*, QEvent*) () at
/usr/lib/x86_64-linux-gnu/libQt5Widgets.so.5
#9  0x00007ffff7de0723 in octave::octave_qapplication::notify(QObject*,
QEvent*) (this=0x5555556434a0, receiver=<optimized out>, ev=<optimized out>)
    at ../octave-src/libgui/src/octave-qobject.cc:132
#10 0x00007ffff3cbb489 in QCoreApplication::notifyInternal2(QObject*,
QEvent*) () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#11 0x00007ffff3cbe46b in
QCoreApplicationPrivate::sendPostedEvents(QObject*, int, QThreadData*)
() at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#12 0x00007ffff3d0d103 in  () at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#13 0x00007ffff10ddf2e in g_main_context_dispatch () at
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#14 0x00007ffff10de1c8 in  () at /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#15 0x00007ffff10de25c in g_main_context_iteration () at
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#16 0x00007ffff3d0c727 in
QEventDispatcherGlib::processEvents(QFlags<QEventLoop::ProcessEventsFlag>)
() at /usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#17 0x00007fffcedc4401 in  () at /usr/lib/x86_64-linux-gnu/libQt5XcbQpa.so.5
#18 0x00007ffff3cba15b in
QEventLoop::exec(QFlags<QEventLoop::ProcessEventsFlag>) () at
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#19 0x00007ffff3cc2132 in QCoreApplication::exec() () at
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
#20 0x00007ffff7dea17d in octave::qt_application::execute()
(this=this@entry=0x7fffffffc4c0) at
../octave-src/libgui/src/qt-application.cc:73
#21 0x0000555555555396 in main(int, char**) (argc=15,
argv=0x7fffffffc7d8) at ../octave-src/src/main-gui.cc:106

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Tue, May 19, 2020 at 7:07 AM Andreas Weber <[hidden email]> wrote:
I'm not able to reproduce it with debugging symbols... but here is gdb
without without symbols:


How do you run gdb with parallel?

Dmitri.
--
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

John W. Eaton
Administrator
On 5/19/20 7:19 AM, Dmitri A. Sergatskov wrote:

>
>
> On Tue, May 19, 2020 at 7:07 AM Andreas Weber <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm not able to reproduce it with debugging symbols... but here is gdb
>     without without symbols:
>
>
> How do you run gdb with parallel?

would enable core files (ulimit -c unlimited in bash) and run gdb on the
core file resulting from the crash.

Could you also try

   (gdb) thread apply all bt

to see whether that gives additional clues about the location and cause
of the crash?

jwe

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Tue, May 19, 2020 at 8:44 AM John W. Eaton <[hidden email]> wrote:
On 5/19/20 7:19 AM, Dmitri A. Sergatskov wrote:
>
>
> On Tue, May 19, 2020 at 7:07 AM Andreas Weber <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     I'm not able to reproduce it with debugging symbols... but here is gdb
>     without without symbols:
>
>
> How do you run gdb with parallel?

would enable core files (ulimit -c unlimited in bash) and run gdb on the
core file resulting from the crash.

Could you also try

   (gdb) thread apply all bt

to see whether that gives additional clues about the location and cause
of the crash?


I cannot reproduce the crash with parallel. I am still getting some kind of crash while building doc/.
With default compiler flags i get a coredump that appears completely useless:

Stack trace of thread 3125:
                                              #0  0x000000000a70756f n/a (n/a)
                                              #1  0x00007fa14eb3c9ae n/a (/home/dima/src/octave/gcc_def/libgui/.libs/liboctgui.so.6.0.0)
                                              #2  0x00007ffd98e0b660 n/a (n/a)

With "-O0 -ggdb3" no crash.
With "-O1 -ggdb3" or "-O2 -ggdb3" I do not get a coredump. The crash is something like:

/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "splineimages ('doc/interpreter/', 'splinefit3', 'png');"
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 40 column 5
    imwrite at line 125 column 5
    print at line 755 column 13
    sparseimages at line 53 column 5
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "splineimages ('doc/interpreter/', 'splinefit4', 'png');"
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "splineimages ('doc/interpreter/', 'splinefit6', 'png');"
make[2]: *** [Makefile:31328: doc/interpreter/gplot.png] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/home/dima/src/octave/gcc_debug'
make[1]: *** [Makefile:27418: all-recursive] Error 1
make[1]: Leaving directory '/home/dima/src/octave/gcc_debug'
make: *** [Makefile:11053: all] Error 2

the actual input file would be different from run to run, e.g. it could be:
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "geometryimages ('doc/interpreter/', 'inpolygon', 'png');"
error: imwrite: invalid empty image
error: called from
    __imwrite__ at line 40 column 5
    imwrite at line 125 column 5
    print at line 755 column 13
    geometryimages at line 70 column 5
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "interpimages ('doc/interpreter/', 'interpft', 'png');"
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "interpimages ('doc/interpreter/', 'interpn', 'png');"
/bin/sh run-octave --norc --silent --no-history --path /home/dima/src/octave/gcc_debug/../doc/interpreter/ --eval "interpimages ('doc/interpreter/', 'interpderiv1', 'png');"
make[2]: *** [Makefile:31198: doc/interpreter/triplot.png] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/home/dima/src/octave/gcc_debug'
make[1]: *** [Makefile:27418: all-recursive] Error 1
make[1]: Leaving directory '/home/dima/src/octave/gcc_debug'
make: *** [Makefile:11053: all] Error 2

 
jwe

Dmitri.
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Andreas Weber-6
In reply to this post by John W. Eaton
Am 19.05.20 um 14:44 schrieb John W. Eaton:
> Could you also try
>   (gdb) thread apply all bt

https://bpa.st/WBBA

-- Andy

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

mmuetzel
Am 19. Mai 2020 um 16:43 Uhr schrieb "Andreas Weber":
> Am 19.05.20 um 14:44 schrieb John W. Eaton:
> > Could you also try
> >   (gdb) thread apply all bt
>
> https://bpa.st/WBBA

Looking at the backtrace, bug #55908 comes to mind:
https://savannah.gnu.org/bugs/index.php?55908

Cleaning up the graphics tree is quite complicated. And honestly, I would be happy if some fresh eyes could revisit the changes that were made for this bug.
I have a feeling that something isn't right with it. But I can't put the finger on what exactly is wrong...

Markus

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Tue, May 19, 2020 at 12:44 PM "Markus Mützel" <[hidden email]> wrote:
Am 19. Mai 2020 um 16:43 Uhr schrieb "Andreas Weber":
> Am 19.05.20 um 14:44 schrieb John W. Eaton:
> > Could you also try
> >   (gdb) thread apply all bt
>
> https://bpa.st/WBBA

Looking at the backtrace, bug #55908 comes to mind:
https://savannah.gnu.org/bugs/index.php?55908

Cleaning up the graphics tree is quite complicated. And honestly, I would be happy if some fresh eyes could revisit the changes that were made for this bug.
I have a feeling that something isn't right with it. But I can't put the finger on what exactly is wrong...

Markus

Should we switch to bug-tracker?
I was able to get a crash when I bumped the jobs to 200.
bt is attached. The relevant part seems to be:

Thread 6 (Thread 0x7fb3754d5700 (LWP 31096)):
#0  0x00007fb3ad162774 in std::stack<octave::action_container::elem*, std::deque<octave::action_container::elem*, std::allocator<octave::action_container::elem*> > >::pop()
    (this=0x7fb3754d0eb8) at /usr/include/c++/8/bits/stl_stack.h:258
#1  0x00007fb3ad162172 in octave::unwind_protect::run_first() (this=0x7fb3754d0f60) at ../liboctave/util/unwind-prot.h:68
#2  0x00007fb3a99f5c68 in octave::action_container::run(unsigned long) (this=0x7fb3754d0f60, num=2) at ../liboctave/util/action-container.cc:38
#3  0x00007fb3ad161e8f in octave::action_container::run() (this=0x7fb3754d0f60) at ../liboctave/util/action-container.h:198
#4  0x00007fb3ad1620d4 in octave::unwind_protect::~unwind_protect() (this=0x7fb3754d0f60, __in_chrg=<optimized out>) at ../liboctave/util/unwind-prot.h:58
#5  0x00007fb3ac59a5b9 in base_graphics_object::remove_all_listeners() (this=0x7fb360488de0) at ../libinterp/corefcn/graphics.cc:3738
#6  0x00007fb3ac6e972c in graphics_object::remove_all_listeners() (this=0x7fb3603b24d8) at libinterp/corefcn/graphics.h:3126
#7  0x00007fb3ac595ad1 in gh_manager::free(octave_handle const&, bool) (this=0x7fb360104e40, h=..., from_root=true) at ../libinterp/corefcn/graphics.cc:2875
#8  0x00007fb3ac58e1cf in children_property::do_delete_children(bool, bool) (this=0x7fb36047b988, clear=true, from_root=true) at ../libinterp/corefcn/graphics.cc:1860
#9  0x00007fb3ac6e628c in children_property::delete_children(bool, bool) (this=0x7fb36047b988, clear=true, from_root=true) at libinterp/corefcn/graphics.h:1775
#10 0x00007fb3ac6e72eb in base_properties::delete_children(bool, bool) (this=0x7fb36047b700, clear=true, from_root=true) at libinterp/corefcn/graphics.h:2339
#11 0x00007fb3ac595af6 in gh_manager::free(octave_handle const&, bool) (this=0x7fb360104e40, h=..., from_root=true) at ../libinterp/corefcn/graphics.cc:2877

 hg id d9551fd70fc6 tip @

Dmitri.
--


Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov
(now it is attached for real)

On Tue, May 19, 2020 at 3:24 PM Dmitri A. Sergatskov <[hidden email]> wrote:


On Tue, May 19, 2020 at 12:44 PM "Markus Mützel" <[hidden email]> wrote:
Am 19. Mai 2020 um 16:43 Uhr schrieb "Andreas Weber":
> Am 19.05.20 um 14:44 schrieb John W. Eaton:
> > Could you also try
> >   (gdb) thread apply all bt
>
> https://bpa.st/WBBA

Looking at the backtrace, bug #55908 comes to mind:
https://savannah.gnu.org/bugs/index.php?55908

Cleaning up the graphics tree is quite complicated. And honestly, I would be happy if some fresh eyes could revisit the changes that were made for this bug.
I have a feeling that something isn't right with it. But I can't put the finger on what exactly is wrong...

Markus

Should we switch to bug-tracker?
I was able to get a crash when I bumped the jobs to 200.
bt is attached. The relevant part seems to be:

Thread 6 (Thread 0x7fb3754d5700 (LWP 31096)):
#0  0x00007fb3ad162774 in std::stack<octave::action_container::elem*, std::deque<octave::action_container::elem*, std::allocator<octave::action_container::elem*> > >::pop()
    (this=0x7fb3754d0eb8) at /usr/include/c++/8/bits/stl_stack.h:258
#1  0x00007fb3ad162172 in octave::unwind_protect::run_first() (this=0x7fb3754d0f60) at ../liboctave/util/unwind-prot.h:68
#2  0x00007fb3a99f5c68 in octave::action_container::run(unsigned long) (this=0x7fb3754d0f60, num=2) at ../liboctave/util/action-container.cc:38
#3  0x00007fb3ad161e8f in octave::action_container::run() (this=0x7fb3754d0f60) at ../liboctave/util/action-container.h:198
#4  0x00007fb3ad1620d4 in octave::unwind_protect::~unwind_protect() (this=0x7fb3754d0f60, __in_chrg=<optimized out>) at ../liboctave/util/unwind-prot.h:58
#5  0x00007fb3ac59a5b9 in base_graphics_object::remove_all_listeners() (this=0x7fb360488de0) at ../libinterp/corefcn/graphics.cc:3738
#6  0x00007fb3ac6e972c in graphics_object::remove_all_listeners() (this=0x7fb3603b24d8) at libinterp/corefcn/graphics.h:3126
#7  0x00007fb3ac595ad1 in gh_manager::free(octave_handle const&, bool) (this=0x7fb360104e40, h=..., from_root=true) at ../libinterp/corefcn/graphics.cc:2875
#8  0x00007fb3ac58e1cf in children_property::do_delete_children(bool, bool) (this=0x7fb36047b988, clear=true, from_root=true) at ../libinterp/corefcn/graphics.cc:1860
#9  0x00007fb3ac6e628c in children_property::delete_children(bool, bool) (this=0x7fb36047b988, clear=true, from_root=true) at libinterp/corefcn/graphics.h:1775
#10 0x00007fb3ac6e72eb in base_properties::delete_children(bool, bool) (this=0x7fb36047b700, clear=true, from_root=true) at libinterp/corefcn/graphics.h:2339
#11 0x00007fb3ac595af6 in gh_manager::free(octave_handle const&, bool) (this=0x7fb360104e40, h=..., from_root=true) at ../libinterp/corefcn/graphics.cc:2877

 hg id d9551fd70fc6 tip @

Dmitri.
--



das_bt_20200519.txt.gz (4K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

John W. Eaton
Administrator
On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:

>     Should we switch to bug-tracker?
>     I was able to get a crash when I bumped the jobs to 200.
>     bt is attached. The relevant part seems to be:

If I use a large number of jobs, I see

   error: imwrite: invalid empty image
   error: called from
       __imwrite__ at line 40 column 5
       imwrite at line 125 column 5
       print at line 755 column 13
       interpimages at line 72 column 5

but no segfaults.

It does look like a threading issue.

jwe

Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov


On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]> wrote:
On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:

>     Should we switch to bug-tracker?
>     I was able to get a crash when I bumped the jobs to 200.
>     bt is attached. The relevant part seems to be:

If I use a large number of jobs, I see

   error: imwrite: invalid empty image
   error: called from
       __imwrite__ at line 40 column 5
       imwrite at line 125 column 5
       print at line 755 column 13
       interpimages at line 72 column 5

but no segfaults.

It does look like a threading issue.

I used a simplified test by Andreas:

parallel -N0 -q octave --norc --silent --no-history --eval 'figure (1, "visible", "off");' ::: {1..200}
 

jwe

Dmitri.
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

Dmitri A. Sergatskov

I compiled with thread sanitizer. It gives a lot of warnings just for starting/stopping octave. If I run
TSAN_OPTIONS=second_deadlock_stack=1 ./run-octave --norc --silent --no-history --eval 'figure (1, "visible", "off");' 2>tsan_plot.txt

A warnings that stands out is:
WARNING: ThreadSanitizer: data race (pid=24895)
  Atomic write of size 8 at 0x7b0800000e80 by main thread (mutexes: write M103436625102881360):
    #0 __tsan_atomic64_fetch_add <null> (libtsan.so.0+0x6987d)
    #1 octave_atomic_increment ../liboctave/util/oct-atomic.c:41 (liboctave.so.8+0xd66884)
    #2 dim_vector::increment_count() ../liboctave/array/dim-vector.h:104 (liboctgui.so.6+0x20da84)
    #3 dim_vector::dim_vector() ../liboctave/array/dim-vector.h:271 (liboctgui.so.6+0x20dbe3)
    #4 Array<double>::Array() ../liboctave/array/Array.h:257 (liboctgui.so.6+0x20fe59)
    #5 MArray<double>::MArray() ../liboctave/array/MArray.h:72 (liboctgui.so.6+0x20f640)
    #6 NDArray::NDArray() <null> (liboctgui.so.6+0x20e142)
    #7 Matrix::Matrix() ../liboctave/array/dMatrix.h:62 (liboctgui.so.6+0x20e31e)
    #8 QtHandles::Figure::Figure(octave::base_qobject&, octave::interpreter&, graphics_object const&, QtHandles::FigureWindow*) ../libgui/graphics/Figure.cc:137 (liboctgui.so.6+0x2305dd)
    #9 QtHandles::Figure::create(octave::base_qobject&, octave::interpreter&, graphics_object const&) ../libgui/graphics/Figure.cc:116 (liboctgui.so.6+0x230155)
    #10 QtHandles::qt_graphics_toolkit::create_object(double) ../libgui/graphics/qt-graphics-toolkit.cc:405 (liboctgui.so.6+0x28ee58)
    #11 QtHandles::qt_graphics_toolkit::qt_static_metacall(QObject*, QMetaObject::Call, int, void**) libgui/graphics/moc-qt-graphics-toolkit.cc:122 (liboctgui.so.6+0x29a2c4)
    #12 QObject::event(QEvent*) <null> (libQt5Core.so.5+0x27c935)
    #13 QCoreApplication::notifyInternal2(QObject*, QEvent*) <null> (libQt5Core.so.5+0x252ec5)
    #14 octave::qt_application::execute() ../libgui/src/qt-application.cc:73 (liboctgui.so.6+0x366cb1)
    #15 main ../src/main-gui.cc:106 (lt-octave-gui+0x401c32)

  Previous read of size 8 at 0x7b0800000e80 by thread T6:
    #0 octave_atomic_increment ../liboctave/util/oct-atomic.c:43 (liboctave.so.8+0xd66890)
    #1 dim_vector::increment_count() ../liboctave/array/dim-vector.h:104 (liboctgui.so.6+0x20da84)
    #2 dim_vector::dim_vector() ../liboctave/array/dim-vector.h:271 (liboctgui.so.6+0x20dbe3)

<...>

the full tsan_plot.txt is attached.

Dmitri.
--



tsan_plot.txt.gz (6K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: segfaults building documentation when machine under load

John W. Eaton
Administrator
In reply to this post by Dmitri A. Sergatskov
On 5/19/20 4:11 PM, Dmitri A. Sergatskov wrote:

>
>
> On Tue, May 19, 2020 at 4:02 PM John W. Eaton <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 5/19/20 3:26 PM, Dmitri A. Sergatskov wrote:
>
>      >     Should we switch to bug-tracker?
>      >     I was able to get a crash when I bumped the jobs to 200.
>      >     bt is attached. The relevant part seems to be:
>
>     If I use a large number of jobs, I see
>
>         error: imwrite: invalid empty image
>         error: called from
>             __imwrite__ at line 40 column 5
>             imwrite at line 125 column 5
>             print at line 755 column 13
>             interpimages at line 72 column 5
>
>     but no segfaults.
>
>     It does look like a threading issue.
>
>
> I used a simplified test by Andreas:
>
> parallel -N0 -q octave --norc --silent --no-history --eval 'figure
> (1,"visible", "off");' ::: {1..200}

OK, I'm able to duplicate the problem using this method.

In the stack traces I've seen, Octave is crashing inside the interpreter
object destructor while attempting to close any remaining figures.

Changing the eval above to be

   figure (1, "visible", "off"); close ("all"); pause (1);

eliminates the crash for me, apparently because then there are no figure
windows to close when exiting.  But attempting the same thing in the
doc/interpreter scripts that generate plots I see the "invalid empty
image" error on every attempt to create a figure, at least when using a
large number of parallel Make jobs.

So clearly this kind of change is not a solution, but it may point us
toward one.  Ultimately, we need to determine the correct sequence for
shutting down the GUI and interpreter, including what actions can happen
or need to be blocked, and what signals need to be sent or locks
acquired so that there are no more races between the threads.

It's a bit tricky because figures can have callbacks set to run when the
figure is closed.  Do we expect those to run when Octave is in the
process of exiting or is it OK to skip them?  Those functions could
register code to run when Octave exits.  Should that be possible?  Would
it be OK for an atexit function to display a graph?  What is reasonable
to expect or attempt to do?

jwe


12