Shuffling elements in a dataset (fwd)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Shuffling elements in a dataset (fwd)

Ted.Harding
( Re Message From: [hidden email] )

>
> I have the need to randomize the order (shuffle) of very large
> datasets. The way I devise, randonly sampling with elimination, is
> not very efficient. Is there a better way, using octave's matrix
> manipulation?
>
> My way:
>
>     nm = num = rows(data);
>
>     for i=1:num
>         rn = ceil(rand * nm--);
>         new_data(i,:) = data(rn,:);
>         data(rn,:) = [];
>     endfor
>
> Better way: perhaps creating a vector of unique indexes? but how to
> do this?
>
>     idx = 1:rows(data);
>     now shuffle idx
>     new_data = data(idx,:)
>
> Of course, this it is the same problem in one dimension...

I find that something like

   [dummy,ix] = sort(rand(1,rows(x)));  new_x = x(ix,:);

seems pretty fast. (0.04 secs for 10000 rows, 0.05 secs for 100000 rows,
or for 1000000, on a 386-DX/25MHz; 0.003 secs for 10000 rows, 0.004 secs
for 100000 rows, or for 1000000, on Pentium-120, i.e. almost independent
of number of rows. However for 10000000 rows it starts swapping and takes
a while (48 MB RAM)). Above timings for 1 column only; reduce sizes pro
rata for extra columns (RAM limit).

Ted.                                    ([hidden email])

Reply | Threaded
Open this post in threaded view
|

Shuffling elements in a dataset (fwd)

John W. Eaton-6
On 21-Feb-1997, Ted Harding <[hidden email]> wrote:

| I find that something like
|
|    [dummy,ix] = sort(rand(1,rows(x)));  new_x = x(ix,:);
|
| seems pretty fast. (0.04 secs for 10000 rows, 0.05 secs for 100000 rows,
| or for 1000000, on a 386-DX/25MHz; 0.003 secs for 10000 rows, 0.004 secs
| for 100000 rows, or for 1000000, on Pentium-120, i.e. almost independent
| of number of rows. However for 10000000 rows it starts swapping and takes
| a while (48 MB RAM)). Above timings for 1 column only; reduce sizes pro
| rata for extra columns (RAM limit).

Hmm.  Here is what I get (counting down to minimize memory problems):

  for n = [[8, 6, 4, 2]*1e5, 10.^[5, 4, 3, 2, 1]]
    x = rand (n, 1);
    t = cputime ();
    [dummy,ix] = sort(rand(1,rows(x)));
    new_x = x(ix,:);
    t = cputime () - t;
    printf ("N = %8d, t = %6.2f seconds, n*log(n) = %.2e\n", n, t, n*log(n));
    fflush (stdout);
  endfor

  N =   500000, t =  13.96 seconds
  N =   100000, t =   2.30 seconds
  N =    10000, t =   0.17 seconds
  N =     1000, t =   0.02 seconds
  N =      100, t =   0.01 seconds
  N =       10, t =   0.01 seconds

at N == 10^6, I get into paging trouble on my 64MB Pentium system.

How did you end up with almost constant time?  I took Octave's sort
algorithm from Knuth.  It's good, but I don't think it's *that* good.

jwe