Just for the sake of comparison here is the code used in numpy, the
convention (and seems to be widespread is to sample [0,1) )

This is also the convention used by C++. The documentation for the
uniform random number generator at

http://www.cplusplus.com/reference/random/uniform_real_distribution/
says "This distribution (also known as rectangular distribution)
produces random numbers in a range

`[a,b)` where all
intervals of the same length within it are equally probable."

However, we have no freedom to follow this convention. The
Mathworks has decreed that the range shall not include the
endpoints. See

https://www.mathworks.com/help/matlab/ref/rand.html
where it is stated, "

`X = rand`

returns a single
uniformly distributed random number in the interval (0,1)."

A good idea to compare to existing implementations. In this case,
32 bits of random data are generated, 9 low-order bits are discarded
by shifting left, leaving 23 bits of valid data, and then it is
normalized by dividing by 2^23. I believe that this unnecessarily
discards one bit of information.

The IEEE single-precision (

https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
stores 23 bits of mantissa, but because it uses a biased exponent it
actually has 24-bits of effective mantissa. Hence, my
implementation used 24 bits.

The other difference between my implementation and theirs is simply
the selection of random bits. Do you choose randi32[31:8] or
randi32[23:0]? All of the bits are random so the choice becomes
which strategy is more efficient. I used AND masking

i = randi32 () & static_cast<uint32_t> (0xFFFFFF);

to get the lowest 24 bits. For this sort of thing you really need to
know how your hardware is going to behave. There are dedicated souls
who have measured the number of clock cycles per instruction for
various architectures (see

www.agner.org/optimize/instruction_tables.pdf).
I chose, as reasonably representative of a modern computer, an Intel
Nehalem-architecture CPU. Both AND with immediate data and SHR
(SHift Right) are single clock-cycle instructions. However, because
of parallel ALUs, when considered in a pipeline, the AND instruction
takes 0.33 clock cycles while the SHR takes 0.5 clock cycles. Thus
a slight preference for the coding strategy I used but nothing
particularly definitive.

Dan mentioned that the algorithm doesn't normalize using division by
a constant, but rather multiplication by a constant.

`return i * (1.0f / 16777216.0f)`

He also mentioned vagaries in floating point hardware units.
Both points are valid. Using the same data source as above, the
FMUL (Floating point MULtiply) has a latency of 5 clock cycles, but
can be fully pipelined so that sequential multiplies can be made
every clock cycle. The FDIV (Floating point DIVision) instruction
has a variable latency of 7-27 cycles and can't be pipelined
effectively. Thus, multiplication is the preferred strategy and any
decent compiler will optimize the division in to a single
compile-time constant and then use multiplication.

Dan also mentioned directly copying bits from the integer to the
floating point value. This is possible to code, but the reason to
do this would be for performance rather than correctness. Right
now, I'm more worried about correctness, and I would lean away from
introducing further changes that could bring their own set of
issues. For instance, which bits to set will depend on whether the
processor is big-endian or little-endian, or even hermaphroditic
like ARM CPUs that can switch on the fly. That could be an
optimization for Octave 7, or we could just switch to using the C++
<random> libraries instead.

Gene suggested a way for getting rid of the while loop which tests
for 0 with

` return ((randi32 () & uint32_t (0x00FFFFFE)) + 1.f)`

` / std::exp2
(std::numeric_limits<float>::digits);`
The problem here is that this spreads values over only 2^23 floating
point values rather the larger, but still possible, 2^24. To see
this, consider that the mask 0xFF_FF_FE has only 23 bits and that
the addition of 1.f is equivalent to forcibly setting the
least-significant bit (i | 0x01). The values generated will always
be odd (1, 3, 5) rather than also including the evens (1,2,3,4,
...).

So, after the very useful input, I think this code should work

` static float randu32 (void)`

` {`

` uint32_t i;`

` do`

` {`

` i = randi32 () & static_cast<uint32_t>
(0xFFFFFF);`

` }`

` while (i == 0);`

` return i * (1.0f / 16777216.0f);`
which is nearly the same as before, except it removes the
unnecessary exclusion of i == 0xFFFFFF which several people pointed
out.

--Rik