random number generation, part 1

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

random number generation, part 1

Rik-4
On 12/17/2019 01:54 PM, [hidden email] wrote:

Just for the sake of comparison here is the code used in numpy, the
convention (and seems to be widespread is to sample [0,1) )

This is also the convention used by C++.  The documentation for the uniform random number generator at http://www.cplusplus.com/reference/random/uniform_real_distribution/ says "This distribution (also known as rectangular distribution) produces random numbers in a range [a,b) where all intervals of the same length within it are equally probable."

However, we have no freedom to follow this convention.  The Mathworks has decreed that the range shall not include the endpoints.  See https://www.mathworks.com/help/matlab/ref/rand.html where it is stated, "X = rand returns a single uniformly distributed random number in the interval (0,1)."

A good idea to compare to existing implementations.  In this case, 32 bits of random data are generated, 9 low-order bits are discarded by shifting left, leaving 23 bits of valid data, and then it is normalized by dividing by 2^23.  I believe that this unnecessarily discards one bit of information.

The IEEE single-precision (https://en.wikipedia.org/wiki/Single-precision_floating-point_format) stores 23 bits of mantissa, but because it uses a biased exponent it actually has 24-bits of effective mantissa.  Hence, my implementation used 24 bits.

The other difference between my implementation and theirs is simply the selection of random bits.  Do you choose randi32[31:8] or randi32[23:0]?  All of the bits are random so the choice becomes which strategy is more efficient.  I used AND masking
i = randi32 () & static_cast<uint32_t> (0xFFFFFF);
to get the lowest 24 bits. For this sort of thing you really need to know how your hardware is going to behave. There are dedicated souls who have measured the number of clock cycles per instruction for various architectures (see www.agner.org/optimize/instruction_tables.pdf).  I chose, as reasonably representative of a modern computer, an Intel Nehalem-architecture CPU.  Both AND with immediate data and SHR (SHift Right) are single clock-cycle instructions.  However, because of parallel ALUs, when considered in a pipeline, the AND instruction takes 0.33 clock cycles while the SHR takes 0.5 clock cycles.  Thus a slight preference for the coding strategy I used but nothing particularly definitive.

Dan mentioned that the algorithm doesn't normalize using division by a constant, but rather multiplication by a constant.

return i * (1.0f / 16777216.0f)

He also mentioned vagaries in floating point hardware units.  Both points are valid.  Using the same data source as above, the FMUL (Floating point MULtiply) has a latency of 5 clock cycles, but can be fully pipelined so that sequential multiplies can be made every clock cycle.  The FDIV (Floating point DIVision) instruction has a variable latency of 7-27 cycles and can't be pipelined effectively.  Thus, multiplication is the preferred strategy and any decent compiler will optimize the division in to a single compile-time constant and then use multiplication.

Dan also mentioned directly copying bits from the integer to the floating point value.  This is possible to code, but the reason to do this would be for performance rather than correctness.  Right now, I'm more worried about correctness, and I would lean away from introducing further changes that could bring their own set of issues.  For instance, which bits to set will depend on whether the processor is big-endian or little-endian, or even hermaphroditic like ARM CPUs that can switch on the fly.  That could be an optimization for Octave 7, or we could just switch to using the C++ <random> libraries instead.

Gene suggested a way for getting rid of the while loop which tests for 0 with

  return ((randi32 () & uint32_t (0x00FFFFFE)) + 1.f)
           / std::exp2 (std::numeric_limits<float>::digits);

The problem here is that this spreads values over only 2^23 floating point values rather the larger, but still possible, 2^24.  To see this, consider that the mask 0xFF_FF_FE has only 23 bits and that the addition of 1.f is equivalent to forcibly setting the least-significant bit (i | 0x01).  The values generated will always be odd (1, 3, 5) rather than also including the evens (1,2,3,4, ...).

So, after the very useful input, I think this code should work

 static float randu32 (void)
   {
     uint32_t i;

     do
       {
         i = randi32 () & static_cast<uint32_t> (0xFFFFFF);
       }
     while (i == 0);

     return i * (1.0f / 16777216.0f);

which is nearly the same as before, except it removes the unnecessary exclusion of i == 0xFFFFFF which several people pointed out.

--Rik




Reply | Threaded
Open this post in threaded view
|

Re: random number generation, part 1

Daniel Sebald
On 12/18/19 7:02 PM, Rik wrote:

> Dan also mentioned directly copying bits from the integer to the
> floating point value.  This is possible to code, but the reason to do
> this would be for performance rather than correctness.  Right now, I'm
> more worried about correctness, and I would lean away from introducing
> further changes that could bring their own set of issues.  For instance,
> which bits to set will depend on whether the processor is big-endian or
> little-endian, or even hermaphroditic like ARM CPUs that can switch on
> the fly.  That could be an optimization for Octave 7, or we could just
> switch to using the C++ <random> libraries instead.

Part of this actually was for correctness.  The broader issue is that
typically we aren't interested in some unit-normal type of RN, but
rather we scale the RN to achieve some proper SNR or whatever.  We'd
like to do the scaling before casting to a 24-bit float value.  Hence
something like

v = randi(mu, sigma)

rathern than

v = sigma * randi() + mu

In this case your "* (1.0f / 16777216.0f)" will achieve a clean bit
shift with any modern CPU/ALU.  But if it weren't an even base 2 number
the bits of the mantissa would be quite altered.  I'm thinking then in
the LSB of the mantissa we'd start to see dithering patterns emerge
because the floating point arithmetic has truncate back to 24 bit.

Double precision?  Might not matter too much in that case because of the
increased precision.  Part of it involves the application as well.  Rare
event types of problems really test the RNG.

Anyway, making progress.

Dan