Slow code with many intermediate matrices

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Slow code with many intermediate matrices

niconeuman
Greetings,
I have a program which calls many functions which construct matrices by
repeated use of element-wise multiplications and additions starting from
other matrices. Its part of an attempt at an electronic structure program.
The code is very slow (although that is subjective) and I believe it is
because of the definitions of many intermediate arrays in each function. As
a short example, a small part of the code of one of such functions is:

px_s_dxx_s_0 = QCx.*px_s_px_s_0 + WQx.*px_s_px_s_1 + 1.*oo2q.*(px_s_s_s_0 -
qoppq.*px_s_s_s_1);
px_s_dxy_s_0 = QCx.*px_s_py_s_0 + WQx.*px_s_py_s_1;
px_s_dxz_s_0 = QCx.*px_s_pz_s_0 + WQx.*px_s_pz_s_1;
px_s_dyy_s_0 = QCy.*px_s_py_s_0 + WQy.*px_s_py_s_1 + 1.*oo2q.*(px_s_s_s_0 -
qoppq.*px_s_s_s_1);
px_s_dyz_s_0 = QCy.*px_s_pz_s_0 + WQy.*px_s_pz_s_1;
px_s_dzz_s_0 = QCz.*px_s_pz_s_0 + WQz.*px_s_pz_s_1 + 1.*oo2q.*(px_s_s_s_0 -
qoppq.*px_s_s_s_1);
py_s_dxx_s_0 = QCx.*py_s_px_s_0 + WQx.*py_s_px_s_1 + 1.*oo2q.*(py_s_s_s_0 -
qoppq.*py_s_s_s_1);
py_s_dxy_s_0 = QCx.*py_s_py_s_0 + WQx.*py_s_py_s_1;
py_s_dxz_s_0 = QCx.*py_s_pz_s_0 + WQx.*py_s_pz_s_1;
...
...
...
dxx_s_px_s_1 = QCx.*dxx_s_s_s_1 + WQx.*dxx_s_s_s_2 + 1.*oo2pq.*(px_s_s_s_2);
dxx_s_py_s_1 = QCy.*dxx_s_s_s_1 + WQy.*dxx_s_s_s_2;
dxx_s_pz_s_1 = QCz.*dxx_s_s_s_1 + WQz.*dxx_s_s_s_2;
dxy_s_px_s_1 = QCx.*dxy_s_s_s_1 + WQx.*dxy_s_s_s_2;
dxy_s_py_s_1 = QCy.*dxy_s_s_s_1 + WQy.*dxy_s_s_s_2;
dxy_s_pz_s_1 = QCz.*dxy_s_s_s_1 + WQz.*dxy_s_s_s_2;
...

All these are N x 1 arrays, with N in the tens or even low hundreds. There
are MANY such functions, and this is a very small part.
I think that I'm defining too many arrays such as dxx_s_px_s_0, etc, and
that if I do something like:
d_s_p_s_1 = zeros(ni,nj);
and then fill up the matrices like this:

d_s_p_s_1(some appropriate index) = QCx.*d_s_s_s_1(ind1) +
WQx.*d_s_s_s_2(ind1) + 1.*oo2pq.*(p_s_s_s_2(ind2));

then I would use memory more efficiently. Also I'm not sure that all these
individual sums are not generating more intermediate arrays. Another
approach would be:

d_s_p_s_1(some appropriate index) += QCx.*d_s_s_s_1(ind1);
d_s_p_s_1(some appropriate index) += WQx.*d_s_s_s_2(ind1);
d_s_p_s_1(some appropriate index) += 1.*oo2pq.*(p_s_s_s_2(ind2));

I'm looking for some general advice on how to make this type of code, which
is already vectorized, more efficient. The code was automatically generated
with Python programs that I made. In order to change the code I need to
modify the generator, so I can try things but it takes me some days to
produce a new syntax. That's why I'm asking before trying several approaches
and benchmarking. I also would like to understand better how memory and
other resources are handled by Octave.
Thank you very much!
Nicolas






--
Sent from: https://octave.1599824.n4.nabble.com/Octave-General-f1599825.html