removing replicated rows from a (large) matrix

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

removing replicated rows from a (large) matrix

Joao Cardoso-3

Hi,

I often need to remove repeated rows from a matrix. I use the following
script, which is the fastest I could think about (I have not checked
Knuth).

Is there a faster method?

Thanks,
Joao

function ix_del = unique(data)

# ix = unique(data)
#
# return the indexes of identical rows entries in `data',
# leaving out one repeated example.
#
# Executing `data(unique(data),:) = []', will leave `data' with unique
rows,

if (isempty(data))
        ix_del = [];
        return
endif

if (columns(data) == 1)
        data = [data data]; # the alg. fail if columns(data) == 1
endif

[t ix] = sort(data(:,1));
data = data(ix,:);                      # organize data as blocks

nr = rows(data);

ix_del = []; t = ones(nr,1);

for i=1:nr
    if (any(i==ix_del))         # already matched
        continue
    endif
   
    ixx = find(data(i,1) == data(i+1:nr,1));    # get a block
    if (isempty(ixx))
       continue
    endif
   
    ixxx = find(all((data(i+ixx,:) ==
ones(length(ixx),1)*data(i,:))'))+i;
    ix_del = [ix_del ixxx];
endfor

ix_del = ix(ix_del);

endfunction



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------


Reply | Threaded
Open this post in threaded view
|

Re: removing replicated rows from a (large) matrix

Paul Kienzle-5
From: Etienne Grossmann <[hidden email]>
>  For the same effect, you can use uniq(), from my "misc" toolbox, after passing
>the data to sortrows (Paul Kienzle's) or lexicosort() ("misc" toolbox).
>
>  Etienne
>
>ps : I think unique() is used as a synonym to create_set() (or almost)

In matlab, unique(A, 'rows') does what you expect.  My unique() needs to
be extended to handle this case.  Feel free to send me updates.  Sorry
it is still only available as part of a tarball, but we are working on it.

        http://users.powernet.co.uk/kienzle/matcompat.tar.gz

>
>
>
>-----------------------------------------------------------------------
>Octave is freely available under the terms of the GNU GPL.
>
>Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
>How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
>Subscription information:  http://www.che.wisc.edu/octave/archive.html
>-----------------------------------------------------------------------
>
>
>



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------


Reply | Threaded
Open this post in threaded view
|

Re: removing replicated rows from a (large) matrix

Etienne Grossmann-4
In reply to this post by Joao Cardoso-3
  Hello,

  are you sure it works? I renamed it joao_uniq, and it gives :

octave:144> x(1:15,:)' , x(joao_uniq (x(1:15,:)),:)'
ans =

  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1
  1  1  1  2  3  4  5  6  9  9  0  0  4  5  5

ans =

  0  0  0  1  1
  1  1  9  0  5

  isn't there a problem?

  For the same effect, you can use uniq(), from my "misc" toolbox, after passing
the data to sortrows (Paul Kienzle's) or lexicosort() ("misc" toolbox).

  Etienne

ps : I think unique() is used as a synonym to create_set() (or almost)



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------


Reply | Threaded
Open this post in threaded view
|

Re: removing replicated rows from a (large) matrix

Joao Cardoso-3
Etienne Grossmann wrote:

>
>   Hello,
>
>   are you sure it works? I renamed it joao_uniq, and it gives :
>
> octave:144> x(1:15,:)' , x(joao_uniq (x(1:15,:)),:)'
> ans =
>
>   0  0  0  0  0  0  0  0  0  0  1  1  1  1  1
>   1  1  1  2  3  4  5  6  9  9  0  0  4  5  5
>
> ans =
>
>   0  0  0  1  1
>   1  1  9  0  5
>
>   isn't there a problem?

no, those are the elements to remove. As the help says,

   Executing `data(unique(data),:) = []', will leave `data' with unique
rows,
                                   ^^^^

I agree, it should not be called 'unique' but 'repeated indexes'.

But your uniq(lexicosort(A)) is faster, and give the same results.
Thanks

Joao


>
>   For the same effect, you can use uniq(), from my "misc" toolbox, after passing
> the data to sortrows (Paul Kienzle's) or lexicosort() ("misc" toolbox).
>
>   Etienne
>
> ps : I think unique() is used as a synonym to create_set() (or almost)



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------


Reply | Threaded
Open this post in threaded view
|

Re: removing replicated rows from a (large) matrix

Etienne Grossmann-3
In reply to this post by Joao Cardoso-3


  Hello,

  thanks for pointing out the bug, Joao.

  The good news is that I fixed the bug, basically by copying Paul
Kienzle and Daniel Calvelo's sortrows() code into lexicosort(). The
main difference between the two is that lexicosort returns the sorted
indices too.

  The bad news is that there was that bug for real-valued data, all
this time ...

  Fixed lex_cmp() and test_lexicosort too, and updated the toolbox in
which they live :

  <a href="http://anonimo.isr.ist.utl.pt/Šþetienne/octave/misc.2000.08.06.tar.gz">http://anonimo.isr.ist.utl.pt/Šþetienne/octave/misc.2000.08.06.tar.gz

  Etienne

======================================================================
Hi,

No free lunch :-)

Your lexicosort() was so fast that I decided to investigate it. It looks
that it only work for integers... you should make it clear in the help,
although 'lexicographicaly' somehow implies discrete values.

The example ilustrates the problem.

octave:238> [b, lexicosort (b)]
ans =

  0.097587  0.819460  0.142899  0.096627
  0.656920  0.779910  0.204675  0.582550
  0.479985  0.728203  0.097587  0.819460
  0.792401  0.302763  0.544916  0.112166
  0.142899  0.096627  0.479985  0.728203
  0.978989  0.806111  0.792401  0.302763
  0.204675  0.582550  0.915432  0.289305
  0.821141  0.502010  0.656920  0.779910
  0.915432  0.289305  0.821141  0.502010
  0.544916  0.112166  0.978989  0.806111

My 'unique' was made for reals. When I wrote it I thought in a similar
approach to yours, considering each column to be a 'digit' of a number,
but I abandoned tha idea.

Thanks,
Joao



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------


Reply | Threaded
Open this post in threaded view
|

Re: removing replicated rows from a (large) matrix

Etienne Grossmann-4
In reply to this post by Joao Cardoso-3
[hidden email]
> main difference between the two is that lexicosort returns the sorted
> indices too.

  And sortrow() accepts a second argument to choose in which order the
columns should be sorted.

  Etienne



-----------------------------------------------------------------------
Octave is freely available under the terms of the GNU GPL.

Octave's home on the web:  http://www.che.wisc.edu/octave/octave.html
How to fund new projects:  http://www.che.wisc.edu/octave/funding.html
Subscription information:  http://www.che.wisc.edu/octave/archive.html
-----------------------------------------------------------------------