Octave 64-bit indexing built with ATLAS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Octave 64-bit indexing built with ATLAS

Felix Willenborg
Dear all,

I was trying to compile Octave with 64-bit indexing as advertised here [3] (see references below). It worked pretty decent with OpenBLAS inspired after a Makefile created by Siko1056 [1]. Nevertheless I wanted to try to compile it with ATLAS and 64-bit indexing in order to achive a higher efficiency/calculation speed and to contribute instructions to [3].

First of all I wouldn't call myself an expert on that topic. Through reading through some posts on the internet I learned (please correct me if I'm wrong) that ATLAS should be faster in it's numeric operations than OpenBLAS/Netlib because of several optimizations and native parallel threading (?).

I realized a working version of ATLAS with 64-bit indexing with the following steps guided by [2], which contains informations on how to build ATLAS generally. I worked with Netlib Lapack 3.7.1 and ATLAS 3.10.3 source code. In /cluster/src/programs/octave/dep/ATLAS-3.10.3 I created a folder 'build', in which I executed:
../configure -b 64 --shared --prefix=/cluster/libraries/ATLAS/3.10.3 --with-netlib-lapack-tarfile=/cluster/src/_DOWNLOAD/lapack-3.7.1.tgz
After that I modified the Make.inc file in order to get the desired conversion from 32-bit integers to 64-bit (I used -finteger-4-integer-8 here instead of -fdefault-integer-8 because I wanted to be careful here). F2CDEFS had to be modified as well because otherwise the check routines crashed (understandably).

/cluster/src/programs/octave/dep/ATLAS-3.10.3/build/Make.inc:
[...]
F2CDEFS = -DAdd_ -DF77_INTEGER=int64_t -DStringSunStyle
[...]
F77FLAGS = -O -mavx -m64 -fPIC -finteger-4-integer-8
F77NOOPT = -O0 -finteger-4-integer-8
[...]
With these modifications the compilation process is ready to go, following some check routines which all passed without errors.
make -j
make check
make ptcheck
make time
make install

To avoid conflicts which may raise whysoever I created some symbolic links with a suffix which will be used later.
ls -l <installation root>/lib/*.so | cut -d'.' -f1 | xargs -i ln -s {}.so {}_Oct64.so
ls -l <installation root>/lib/*.a | cut -d'.' -f1 | xargs -i ln -s {}.a {}_Oct64.a

Also the installation location was from now on prepended to LD_LIBRARY_PATH and LIBRARY_PATH
LD_LIBRARY_PATH=/cluster/libraries/ATLAS/3.10.3/lib:$LD_LIBRARY_PATH
LIBRARY_PATH=/cluster/libraries/ATLAS/3.10.3/lib:$LIBRARY_PATH

To compile qrupdate-1.1.2 I used the command from [3] and adapted it to my setup:
make -j test LAPACK=-ltatlas_Oct64 BLAS=-ltatlas_Oct64 FFLAGS='-O3 -fimplicit-none -finteger-4-integer-8'
Same goes for SuiteSparse-4.5.5...
make -j LAPACK=-ltatlas_Oct64 BLAS=-ltatlas_Oct64 UMFPACK_CONFIG=-D'LONGBLAS=long' CHOLMOD_CONFIG=-D'LONGBLAS=long'
make install LAPACK=-ltatlas_Oct64 BLAS=-ltatlas_Oct64 INSTALL=/cluster/programs/octave/4.2.1-atlas

and for arpack-ng, git requested at 2017-07-01 (YYYY-MM-DD)
./configure --prefix=/cluster/programs/octave/4.2.1 --localstatedir=/var --with-blas=-ltatlas_Oct64 --with-lapack=-ltatlas_Oct64 INTERFACE64=1
Finally octave could be compiled in a very light build to verify its functions etc. The configuration of octave-4.2.1 was realized with
./configure --disable-readline --prefix=/cluster/programs/octave/4.2.1-atlas --enable-64 --enable-static --with-z-includedir=/cluster/libraries/zlib/1.2.8/include --with-z-libdir=/cluster/libraries/zlib/1.2.8/lib --with-hdf5-includedir=/cluster/libraries/HDF5/gcc/1.10.0-patch1/include --with-hdf5-libdir=/cluster/libraries/HDF5/gcc/1.10.0-patch1/lib --with-blas=-ltatlas_Oct64 --with-lapack=-ltatlas_Oct64 --with-openssl=yes --with-java-homedir=/cluster/programs/java/1.8.0-91 --with-java-includedir=/cluster/programs/java/1.8.0-91/include --with-java-libdir=/cluster/programs/java/1.8.0-91/jre/lib/amd64/server F77_INTEGER_8_FLAG='-finteger-4-integer-8' PKG_CONFIG_PATH='/cluster/programs/octave/4.2.1-atlas/lib/pkgconfig:/cluster/libraries/ATLAS/3.10.3/lib/pkgconfig' LD_LIBRARY_PATH='/cluster/libraries/ATLAS/3.10.3/lib:/cluster/programs/octave/4.2.1-atlas/lib:$LD_LIBRARY_PATH' LIBRARY_PATH='/cluster/libraries/ATLAS/3.10.3/lib:/cluster/programs/octave/4.2.1-atlas/lib:$LIBRARY_PATH' CPPFLAGS='-I/cluster/programs/octave/4.2.1-atlas/include -I/cluster/libraries/ATLAS/3.10.3/include' LDFLAGS='-L/cluster/programs/octave/4.2.1-atlas/lib -L/cluster/libraries/ATLAS/3.10.3/lib'
configure passed flawlessly aswell as make. 'make check' passed all test except one, which was a system test which crashed due to some right problems (not worth mentioning). I wanted to check the speed of this octave version with a the following script, which is calling some elementwise matrix multiplications. Over 400 steps the the mean and the standard deviation of the MFLOP/S are calculated out of the array .

octave_mflops.m:
N = 400;
for i = 1:N
    n = 4096;
    x = rand(n, n);
    tic, x = x .* x;
    y = toc;
    mflops(i) = n*n / y / 1e6;
end

mflops_mean = sum(mflops)/N;
mflops_sig = std(mflops);
printf('MFLOPS: (%.2f +- %.2f)\n', mflops_mean, mflops_sig);
Now the funny part is the following, which confuses me a little bit. For the OpenBLAS build and the ATLAS build, I recieve the following values:
MFLOPS: (370.47 +- 7.60) (OpenBLAS)
MFLOPS: (370.43 +- 7.27) (ATLAS)
I expected ATLAS to be faster than OpenBLAS. Also: when monitoring the load with 'htop', only one CPU is fully loaded. I expected ATLAS to have parallel threading, which I tried to ensure by using libtatlas_Oct64.so. Am I expecting something wrong? And why, can someone explain to me what I did wrong?

Best wishes,
Felix Willenborg

References:
[1]: https://github.com/siko1056/GNU-Octave-enable-64/blob/dev/Makefile
[2]: http://mcs.une.edu.au/doc/atlas-devel/doc/atlas_install.pdf
[3]: https://www.gnu.org/software/octave/doc/interpreter/Compiling-Octave-with-64_002dbit-Indexing.html
-- 
Felix Willenborg

Arbeitsgruppe Machine Learning und Exzellenzcluster Hearing4all
Department für Medizinische Physik und Akustik
Fakultät für Medizin und Gesundheitswissenschaften 
Carl von Ossietzky Universität Oldenburg

Küpkersweg 74, 26129 Oldenburg
Tel: +49 441 798 3945

https://www.uni-oldenburg.de/machine-learning/

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Octave 64-bit indexing built with ATLAS

Dmitri A. Sergatskov


On Mon, Sep 11, 2017 at 1:39 PM, Felix Willenborg <[hidden email]> wrote:


octave_mflops.m:
N = 400;
for i = 1:N
    n = 4096;
    x = rand(n, n);
    tic, x = x .* x;
    y = toc;
    mflops(i) = n*n / y / 1e6;
end

mflops_mean = sum(mflops)/N;
mflops_sig = std(mflops);
printf('MFLOPS: (%.2f +- %.2f)\n', mflops_mean, mflops_sig);
Now the funny part is the following, which confuses me a little bit. For the OpenBLAS build and the ATLAS build, I recieve the following values:
MFLOPS: (370.47 +- 7.60) (OpenBLAS)
MFLOPS: (370.43 +- 7.27) (ATLAS)
I expected ATLAS to be faster than OpenBLAS. Also: when monitoring the load with 'htop', only one CPU is fully loaded. I expected ATLAS to have parallel threading, which I tried to ensure by using libtatlas_Oct64.so. Am I expecting something wrong? And why, can someone explain to me what I did wrong?


​Openblas is (generally) faster.
x .* x does not use blas/lapack (hence, atlas) code, so that is why your result did not change.
Whatever number you calculate is a benchmark, but probably not actual mflops.
Replacing x.*x with x * x​ (and setting N = 4) I get on an old 4-core computer

With openblas:
octave:1> octave_mflops
MFLOPS: (5.19 +- 0.06)
​with atlas:
​octave:1> octave_mflops
MFLOPS: (3.56 +- 0.01)

​Regards,

Dmitri.
​--


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Octave 64-bit indexing built with ATLAS

Felix Willenborg
Dear Dmitri,

thanks for your reply. Good to know. As I said, I wouldn't call me an expert on that topic. When executing matrix multiplication instead of elementwise multiplication, I get the following outputs:
MFLOPS: (5.122373 +- 0.029796) (Reference LAPACK)
MFLOPS: (24.231716 +- 0.074426) (ATLAS)
MFLOPS: (29.056399 +- 0.131886) (OpenBLAS)
Now all cores are being used aswell (except reference LAPACK).

I have to do one correction in my compilation procedure though. I compiled ATLAS not with 'make -j'. This crashes somehow all the time. It has to be compiled with a simple 'make'. Maybe someone wants to validate the whole procedure and try it out so it can be added to https://www.gnu.org/software/octave/doc/interpreter/Compiling-Octave-with-64_002dbit-Indexing.html?

Best wishes,
Felix

Am 12.09.2017 um 00:51 schrieb Dmitri A. Sergatskov:


On Mon, Sep 11, 2017 at 1:39 PM, Felix Willenborg <[hidden email]> wrote:


octave_mflops.m:
N = 400;
for i = 1:N
    n = 4096;
    x = rand(n, n);
    tic, x = x .* x;
    y = toc;
    mflops(i) = n*n / y / 1e6;
end

mflops_mean = sum(mflops)/N;
mflops_sig = std(mflops);
printf('MFLOPS: (%.2f +- %.2f)\n', mflops_mean, mflops_sig);
Now the funny part is the following, which confuses me a little bit. For the OpenBLAS build and the ATLAS build, I recieve the following values:
MFLOPS: (370.47 +- 7.60) (OpenBLAS)
MFLOPS: (370.43 +- 7.27) (ATLAS)
I expected ATLAS to be faster than OpenBLAS. Also: when monitoring the load with 'htop', only one CPU is fully loaded. I expected ATLAS to have parallel threading, which I tried to ensure by using libtatlas_Oct64.so. Am I expecting something wrong? And why, can someone explain to me what I did wrong?


​Openblas is (generally) faster.
x .* x does not use blas/lapack (hence, atlas) code, so that is why your result did not change.
Whatever number you calculate is a benchmark, but probably not actual mflops.
Replacing x.*x with x * x​ (and setting N = 4) I get on an old 4-core computer

With openblas:
octave:1> octave_mflops
MFLOPS: (5.19 +- 0.06)
​with atlas:
​octave:1> octave_mflops
MFLOPS: (3.56 +- 0.01)

​Regards,

Dmitri.
​--


-- 
Felix Willenborg

Arbeitsgruppe Machine Learning und Exzellenzcluster Hearing4all
Department für Medizinische Physik und Akustik
Fakultät für Medizin und Gesundheitswissenschaften 
Carl von Ossietzky Universität Oldenburg

Küpkersweg 74, 26129 Oldenburg
Tel: +49 441 798 3945

https://www.uni-oldenburg.de/machine-learning/

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave