Re: CPU usage by call of C++ code through system() on Linux

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: CPU usage by call of C++ code through system() on Linux

siko1056
On 6/26/20 3:58 PM, Andreas Stahel wrote:

> On 6/26/20 6:28 AM, Kai Torben Ohlhus wrote:
>> On 6/26/20 1:17 AM, Andreas Stahel wrote:
>>> Dear Octave Users
>>>
>>> Maybe one of you can give me a hint on how to make my Octave code run
>>> faster.
>>> Within a good size program (run time 40 sec) the command system() is
>>> used to call a C++ code.
>>> The C++ code uses pthreads.
>>> While the code is running htop show approximately 40% of load by the
>>> kernel on each CPU and 60% "normal" (user space?).
>>>
>>> When running the same code in Matlab only the "normal"load shows and
>>> very little kernel load on the CPUs.
>>> The computation time by Matlab is also only 60% of the time consumed by
>>> Octave (5.2.0)
>>> The system is an Ubuntu 20.04 on a AMD Ryzen 3950X.
>>>
>>> Any hints on what is slowing Octave down?
>>>
>>> With best regards
>>>
>>> Andreas
>>
>>
>> Dear Andreas,
>>
>> Maybe I do not understand your setup correctly.  You have a C++ code
>> using threads compiled to, e.g. "code.exe" (the suffix does not matter),
>> and an Octave script "benchmark.m" with somewhere the code line
>>
>>     system ("code.exe")
>>
>> First question is, do "benchmark.m" and "code.exe" interact with each
>> other?  Means, does "code.exe" compute something that "benchmark.m"
>> processes further by importing results?  What is the purpose of Octave
>> calling "code.exe"?  Benchmarking with tic-toc?
>>
>> Second question, does "code.exe" (standalone, without Octave or Matlab)
>> or "benchmark.m" (called from Octave or Matlab) have a run time of 40
>> seconds?
>>
>> Now to your observation.  When running "benchmark.m" in Octave and
>> Matlab you observe Octave is slower.  I do not understand how this is
>> related to the CPU "kernel" and "normal" usage?  What is the runtime of
>> "benchmark.m" in Matlab and Octave, respectively?  Do you complain not
>> all CPU cores are used?
>>
>> Maybe it is best to give us (some) code to better understand the
>> situation.
>>
>> Kai
>>
> Dear Kai
>
> Thank you for the quick reply and attempt to locate the problem.
> The code in "benchmark.m" is a loop with 600 iterations.
> In each iteration a C++ code is called through system().
> The C++ code is heavily threaded, and using FFTW extensively. FFTW is
> used as single thread library.
> Thu multithreading is "hand coded"
> I have two options set up
>  NumIter = 0, no   FFT computations
>  NumIter = 2, many FFT computations
> In addition I called the binary with a loop in bash.
> These are the observed wall times, averaged for one call of the binary
>
> – Octave NumIter=2 : 59.6 ms, NumIter=0 : 16.3 ms,
> – MATLAB NumIter=2 : 38.3 ms, NumIter=0 : 20.1 ms,
> – bash   NumIter=2 : 37.9 ms, NumIter=0 : 19.2 ms,
>
> This puzzles me thoroughly!
>
> Andreas
>
> PS. on nabble these messages show up in the wrong thread!


Dear Andreas,

The maintainers list was not in the CC.  Sorry for the late reply.

I am still not really convinced, that I understand your setup and the
purpose of your computation.

Is there any output or synchronization between "code.exe" or
"benchmark.m"?  The Octave interpreter interpreting a for-loop alone
consumes already "lots of time" compared to your fast overall
computation time.

   a = 0; tic; for i = 1:600, a = a + i; end; toc

   Octave 1.53995 ms.
   Matlab 0.025   ms.

So maybe you just measure "slow" code interpretation when the body of
the for-loop is "heavier" than the one shown above?

Do you measure your wall time inside "code.exe" or in "benchmark.m" by
tic-toc, like in my example?  Maybe you find no differences, if you use
a more precise C/C++ library to measure the wall time and return it for
further processing by Octave or Matlab?

Kai

Reply | Threaded
Open this post in threaded view
|

Re: CPU usage by call of C++ code through system() on Linux

Andreas Stahel-6


On 29.06.20 09:49, Kai Torben Ohlhus wrote:

> On 6/26/20 3:58 PM, Andreas Stahel wrote:
>> On 6/26/20 6:28 AM, Kai Torben Ohlhus wrote:
>>> On 6/26/20 1:17 AM, Andreas Stahel wrote:
>>>> Dear Octave Users
>>>>
>>>> Maybe one of you can give me a hint on how to make my Octave code run
>>>> faster.
>>>> Within a good size program (run time 40 sec) the command system() is
>>>> used to call a C++ code.
>>>> The C++ code uses pthreads.
>>>> While the code is running htop show approximately 40% of load by the
>>>> kernel on each CPU and 60% "normal" (user space?).
>>>>
>>>> When running the same code in Matlab only the "normal"load shows and
>>>> very little kernel load on the CPUs.
>>>> The computation time by Matlab is also only 60% of the time consumed by
>>>> Octave (5.2.0)
>>>> The system is an Ubuntu 20.04 on a AMD Ryzen 3950X.
>>>>
>>>> Any hints on what is slowing Octave down?
>>>>
>>>> With best regards
>>>>
>>>> Andreas
>>>
>>> Dear Andreas,
>>>
>>> Maybe I do not understand your setup correctly.  You have a C++ code
>>> using threads compiled to, e.g. "code.exe" (the suffix does not matter),
>>> and an Octave script "benchmark.m" with somewhere the code line
>>>
>>>      system ("code.exe")
>>>
>>> First question is, do "benchmark.m" and "code.exe" interact with each
>>> other?  Means, does "code.exe" compute something that "benchmark.m"
>>> processes further by importing results?  What is the purpose of Octave
>>> calling "code.exe"?  Benchmarking with tic-toc?
>>>
>>> Second question, does "code.exe" (standalone, without Octave or Matlab)
>>> or "benchmark.m" (called from Octave or Matlab) have a run time of 40
>>> seconds?
>>>
>>> Now to your observation.  When running "benchmark.m" in Octave and
>>> Matlab you observe Octave is slower.  I do not understand how this is
>>> related to the CPU "kernel" and "normal" usage?  What is the runtime of
>>> "benchmark.m" in Matlab and Octave, respectively?  Do you complain not
>>> all CPU cores are used?
>>>
>>> Maybe it is best to give us (some) code to better understand the
>>> situation.
>>>
>>> Kai
>>>
>> Dear Kai
>>
>> Thank you for the quick reply and attempt to locate the problem.
>> The code in "benchmark.m" is a loop with 600 iterations.
>> In each iteration a C++ code is called through system().
>> The C++ code is heavily threaded, and using FFTW extensively. FFTW is
>> used as single thread library.
>> Thu multithreading is "hand coded"
>> I have two options set up
>>   NumIter = 0, no   FFT computations
>>   NumIter = 2, many FFT computations
>> In addition I called the binary with a loop in bash.
>> These are the observed wall times, averaged for one call of the binary
>>
>> – Octave NumIter=2 : 59.6 ms, NumIter=0 : 16.3 ms,
>> – MATLAB NumIter=2 : 38.3 ms, NumIter=0 : 20.1 ms,
>> – bash   NumIter=2 : 37.9 ms, NumIter=0 : 19.2 ms,
>>
>> This puzzles me thoroughly!
>>
>> Andreas
>>
>> PS. on nabble these messages show up in the wrong thread!
>
> Dear Andreas,
>
> The maintainers list was not in the CC.  Sorry for the late reply.
>
> I am still not really convinced, that I understand your setup and the
> purpose of your computation.
>
> Is there any output or synchronization between "code.exe" or
> "benchmark.m"?  The Octave interpreter interpreting a for-loop alone
> consumes already "lots of time" compared to your fast overall
> computation time.
>
>     a = 0; tic; for i = 1:600, a = a + i; end; toc
>
>     Octave 1.53995 ms.
>     Matlab 0.025   ms.
>
> So maybe you just measure "slow" code interpretation when the body of
> the for-loop is "heavier" than the one shown above?
>
> Do you measure your wall time inside "code.exe" or in "benchmark.m" by
> tic-toc, like in my example?  Maybe you find no differences, if you use
> a more precise C/C++ library to measure the wall time and return it for
> further processing by Octave or Matlab?
>
> Kai
Dear Kai
Thank you for your effort.
Here an attempt to clear up the situation.
The loop runs over 600 frames, the timings given as average per frame.

In the code "benchmark.m" the time per frame is measured by a
tic()/toc() pair.
tic();
system(command);  %% this is where the computations are performed
systemtime = toc();
display(sprintf('time = %f',systemtime))              % to get an
impression while it is running
systemtimetotal = systemtimetotal+systemtime;

Based on your suggestion I added two system calls to  gettimeofday() in
the C code.
The observed timing is consistent with the tic()/toc() result, i.e.
tic()/toc() slightly higher.

The C code was compiled with
gcc -O3 -Wall  RunMultipleTH_z_Neumann2.c -lpthread -lm -lfftw3 -o
RunMultipleTH_z_Neumann2

"benchmark.m"  and "code.exe" exchange some information through files.
    I timed those file reads and writes, it uses very little time.

on a host with a Ryzen 3950X CPU
* running "code.exe" in a bash loop leads to 33 ms per frame
   htop has almost all of the CPU load assigned to the user
* running the code in Octave leads to  59 ms per frame
   htop has a sizable part of the CPU load assigned to kernel
* running the code in Matlab leads to 37 ms per frame
   htop has almost all of the CPU load assigned to the user

If I reduce the FFTW computations withing "code.exe" Octave is faster than
bash or Matlab, but by very little. The multiple threads are still
launched within the C code,
but no FFT 2D operations applied.


On a host with a Intel Xeon E5-1650 CPU a similar effect occurs, not
quite as drastic
            bash      80 ms
            Matlab   99 ms
            Octave 127 ms

I have no idea what could cause this surprising effect.

Enjoy the day

Andreas

--
[hidden email]