Ziggurat code

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ziggurat code

Allin Cottrell
I'm a developer of the GPL'd econometrics program gretl, and a
co-worker and I were recently benchmarking our generation of random
normals. We saw that Octave was noticeably faster at this, and I
investigated. It transpired that the speed difference was down to
David Bateman's very nice implementation of the Marsaglia-Tsang
Ziggurat method.

We had our own implementation but David's was faster, without any
sacrifice of quality (good coverage of the real line, passes the
l'Ecuyer "Crush" test), so we have, gratefully and with attribution,
"stolen" David's version for gretl.

Maybe I can give a little something back. There appears to a slight
mix-up in the use of *X86_32 macros in the source file in question,
namely randmtzig.cc. Near the top of this file there appears the
following preprocessor stanza:

#if ! defined (USE_X86_32)
#  if defined (i386) || defined (HAVE_X86_32)
#    define USE_X86_32 1
#  else
#    define USE_X86_32 0
#  endif
#endif

Here the symbol HAVE_X86_32 is coming from "elsewhere" (I presume,
config.h or the compiler command-line), and the symbol USE_X86_32 is
set conditional on that plus the compiler symbol i386. One would
suppose, therefore, that the code below should branch conditional on
USE_X86_32, but that's not the case: USE_X86_32 is not referenced at
all; everything below depends on HAVE_X86_32.

It seems that either (a) the code below should branch on USE_X86_32
rather than HAVE_X86_32, or (b) if HAVE_X86_32 is reckoned to carry
all the relevant information, the aforementioned preprocessor code
should be removed, for clarity.

Related, there's a comment which suggests some uncertainty, "Check
whether -DUSE_X86_32=0 is faster on 64-bit x86 architectures." I can
say something about that. I've tested the two variants of the
Ziggurat code in randmtzig.cc on two 64-bit systems. In each case
the integer RNG is version 1.4.1 of SFMT, using SSE2; the only
difference is in the Ziggurat code.

The experiment involved generating 200 million random normals,
repeated 5 times. In the tables below the columns headed "A" and "B"
hold timings in seconds for the default 64-bit code-path and the
x86 32-bit optimized code-path respectively. The columns to the
right of A and B are elapsed-time indices relative to 100 = best on
platform.

Fedora 23 64-bit, core i7-2600 @3.40GHz, gcc 5.3.1

   A      rel     B      rel
3.3994 100.00  3.9320 115.67
3.4058 100.19  3.8608 113.57
3.4045 100.15  3.9076 114.95
3.5408 100.16  3.9141 115.14
3.4064 100.21  3.8545 113.39

Arch 64-bit, core i7-920 @2.67GHz, gcc 6.2.1

   A      rel     B      rel
3.9583 100.00  4.3021 108.69
3.9867 100.72  4.3653 110.28
3.9781 100.50  4.3308 109.41
4.0371 101.99  4.3953 111.04
3.9899 100.80  4.4652 112.81

There's not a huge difference between the Ziggurat variants, but
nonetheless it's clear, on this sample, that you don't want to use
the 32-bit-optimized code on a 64-bit system. On the more recent
i7-2600 that incurs a time penalty of about 15 percent; on the older
i7-920 the penalty is around 10 percent.

--
Allin Cottrell
Department of Economics
Wake Forest University

Reply | Threaded
Open this post in threaded view
|

Re: Ziggurat code

siko1056
Allin Cottrell wrote
Maybe I can give a little something back. There appears to a slight
mix-up in the use of *X86_32 macros in the source file in question,
namely randmtzig.cc. Near the top of this file there appears the
following preprocessor stanza:

#if ! defined (USE_X86_32)
#  if defined (i386) || defined (HAVE_X86_32)
#    define USE_X86_32 1
#  else
#    define USE_X86_32 0
#  endif
#endif

Here the symbol HAVE_X86_32 is coming from "elsewhere" (I presume,
config.h or the compiler command-line), and the symbol USE_X86_32 is
set conditional on that plus the compiler symbol i386. One would
suppose, therefore, that the code below should branch conditional on
USE_X86_32, but that's not the case: USE_X86_32 is not referenced at
all; everything below depends on HAVE_X86_32.

It seems that either (a) the code below should branch on USE_X86_32
rather than HAVE_X86_32, or (b) if HAVE_X86_32 is reckoned to carry
all the relevant information, the aforementioned preprocessor code
should be removed, for clarity.
Dear Allin Cottrell,

Thank you for your comment and effort! Regarding the preprocessor stanza, I vote for option (b) and opened a bug report to remember to review this: https://savannah.gnu.org/bugs/index.php?49305 

Best,
Kai
Reply | Threaded
Open this post in threaded view
|

Re: Ziggurat code

David Bateman-7
In reply to this post by Allin Cottrell
Thanks for the heads up of your use of the code and the compliment.

As for the USE_X86_32 define, I believe this is a hang over from Paul Kienzle's original Mersenne Twister code. Paul went out of his way to write his code in such a way as it could be built stand alone and this define was part of that.

The code seem to have organically developed since both when it has in Source Forge and after in the core of Octave. So as you say best just to get rid of the define block

D.

> Le 10 oct. 2016 à 01:52, Allin Cottrell <[hidden email]> a écrit :
>
> I'm a developer of the GPL'd econometrics program gretl, and a co-worker and I were recently benchmarking our generation of random normals. We saw that Octave was noticeably faster at this, and I investigated. It transpired that the speed difference was down to David Bateman's very nice implementation of the Marsaglia-Tsang Ziggurat method.
>
> We had our own implementation but David's was faster, without any sacrifice of quality (good coverage of the real line, passes the
> l'Ecuyer "Crush" test), so we have, gratefully and with attribution,
> "stolen" David's version for gretl.
>
> Maybe I can give a little something back. There appears to a slight mix-up in the use of *X86_32 macros in the source file in question, namely randmtzig.cc. Near the top of this file there appears the following preprocessor stanza:
>
> #if ! defined (USE_X86_32)
> #  if defined (i386) || defined (HAVE_X86_32)
> #    define USE_X86_32 1
> #  else
> #    define USE_X86_32 0
> #  endif
> #endif
>
> Here the symbol HAVE_X86_32 is coming from "elsewhere" (I presume, config.h or the compiler command-line), and the symbol USE_X86_32 is set conditional on that plus the compiler symbol i386. One would suppose, therefore, that the code below should branch conditional on USE_X86_32, but that's not the case: USE_X86_32 is not referenced at all; everything below depends on HAVE_X86_32.
>
> It seems that either (a) the code below should branch on USE_X86_32 rather than HAVE_X86_32, or (b) if HAVE_X86_32 is reckoned to carry all the relevant information, the aforementioned preprocessor code should be removed, for clarity.
>
> Related, there's a comment which suggests some uncertainty, "Check whether -DUSE_X86_32=0 is faster on 64-bit x86 architectures." I can say something about that. I've tested the two variants of the Ziggurat code in randmtzig.cc on two 64-bit systems. In each case the integer RNG is version 1.4.1 of SFMT, using SSE2; the only difference is in the Ziggurat code.
>
> The experiment involved generating 200 million random normals, repeated 5 times. In the tables below the columns headed "A" and "B"
> hold timings in seconds for the default 64-bit code-path and the x86 32-bit optimized code-path respectively. The columns to the right of A and B are elapsed-time indices relative to 100 = best on platform.
>
> Fedora 23 64-bit, core i7-2600 @3.40GHz, gcc 5.3.1
>
>  A      rel     B      rel
> 3.3994 100.00  3.9320 115.67
> 3.4058 100.19  3.8608 113.57
> 3.4045 100.15  3.9076 114.95
> 3.5408 100.16  3.9141 115.14
> 3.4064 100.21  3.8545 113.39
>
> Arch 64-bit, core i7-920 @2.67GHz, gcc 6.2.1
>
>  A      rel     B      rel
> 3.9583 100.00  4.3021 108.69
> 3.9867 100.72  4.3653 110.28
> 3.9781 100.50  4.3308 109.41
> 4.0371 101.99  4.3953 111.04
> 3.9899 100.80  4.4652 112.81
>
> There's not a huge difference between the Ziggurat variants, but nonetheless it's clear, on this sample, that you don't want to use the 32-bit-optimized code on a 64-bit system. On the more recent
> i7-2600 that incurs a time penalty of about 15 percent; on the older
> i7-920 the penalty is around 10 percent.
>
> --
> Allin Cottrell
> Department of Economics
> Wake Forest University
>