# Shuffling elements in a dataset (fwd)

2 messages
Open this post in threaded view
|

## Shuffling elements in a dataset (fwd)

 ( Re Message From: [hidden email] ) > > I have the need to randomize the order (shuffle) of very large > datasets. The way I devise, randonly sampling with elimination, is > not very efficient. Is there a better way, using octave's matrix > manipulation? > > My way: > >     nm = num = rows(data); > >     for i=1:num >         rn = ceil(rand * nm--); >         new_data(i,:) = data(rn,:); >         data(rn,:) = []; >     endfor > > Better way: perhaps creating a vector of unique indexes? but how to > do this? > >     idx = 1:rows(data); >     now shuffle idx >     new_data = data(idx,:) > > Of course, this it is the same problem in one dimension... I find that something like    [dummy,ix] = sort(rand(1,rows(x)));  new_x = x(ix,:); seems pretty fast. (0.04 secs for 10000 rows, 0.05 secs for 100000 rows, or for 1000000, on a 386-DX/25MHz; 0.003 secs for 10000 rows, 0.004 secs for 100000 rows, or for 1000000, on Pentium-120, i.e. almost independent of number of rows. However for 10000000 rows it starts swapping and takes a while (48 MB RAM)). Above timings for 1 column only; reduce sizes pro rata for extra columns (RAM limit). Ted.                                    ([hidden email])
Open this post in threaded view
|

## Shuffling elements in a dataset (fwd)

 On 21-Feb-1997, Ted Harding <[hidden email]> wrote: | I find that something like | |    [dummy,ix] = sort(rand(1,rows(x)));  new_x = x(ix,:); | | seems pretty fast. (0.04 secs for 10000 rows, 0.05 secs for 100000 rows, | or for 1000000, on a 386-DX/25MHz; 0.003 secs for 10000 rows, 0.004 secs | for 100000 rows, or for 1000000, on Pentium-120, i.e. almost independent | of number of rows. However for 10000000 rows it starts swapping and takes | a while (48 MB RAM)). Above timings for 1 column only; reduce sizes pro | rata for extra columns (RAM limit). Hmm.  Here is what I get (counting down to minimize memory problems):   for n = [[8, 6, 4, 2]*1e5, 10.^[5, 4, 3, 2, 1]]     x = rand (n, 1);     t = cputime ();     [dummy,ix] = sort(rand(1,rows(x)));     new_x = x(ix,:);     t = cputime () - t;     printf ("N = %8d, t = %6.2f seconds, n*log(n) = %.2e\n", n, t, n*log(n));     fflush (stdout);   endfor   N =   500000, t =  13.96 seconds   N =   100000, t =   2.30 seconds   N =    10000, t =   0.17 seconds   N =     1000, t =   0.02 seconds   N =      100, t =   0.01 seconds   N =       10, t =   0.01 seconds at N == 10^6, I get into paging trouble on my 64MB Pentium system. How did you end up with almost constant time?  I took Octave's sort algorithm from Knuth.  It's good, but I don't think it's *that* good. jwe