Vectorizing strcmp

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Vectorizing strcmp

Keith Goodman
On the one hand, it's nice that strcmp is a user-defined
function---it's easy to look at the code to see what it does. On the
other hand, the reason I'm looking at the code is that it is slow.

I'm using strcmp(s1,s2) where s1 is a cell array of strings (around
1000X1) and s2 is a string.

Currently, in 2.1.58, strcmp does

n = length(t1);
for i = 1:n
   retval(i,:) = strcmp (t1{i}, s2);
endfor

which recursively calls strcmp for two strings and where t1 is s1(:).
The for loop is slow.

Short of making strcmp a built-in function, one way to speed it up is
to replace the for loop with a repmat like this

c1 = char(t1);
[rc1, cc1] = size(c1);
ns2 = size(s2,2);
blank = ' ';
ms2 = repmat([s2 repmat(blank,1,max(cc1 - ns2, 0))], rc1, 1);
mc1 = [c1 repmat(blank,1,max(ns2 - cc1, 0))];
retval2 = sum(mc1==ms2, 2)==size(mc1, 2);

The blank is to pad the strings if char(s1) and s2 are different widths.

The second code fragment is about 10 times faster than the first for
length(s1)==1000. For length(s1)==10 the speed is about the same. And
for length(s1) < 10 the first approach is faster.

For me the added complexity in the code is more than offset by the
gain in speed.


Reply | Threaded
Open this post in threaded view
|

Vectorizing strcmp

John W. Eaton-6
On 15-Sep-2004, Keith Goodman <[hidden email]> wrote:

| On the one hand, it's nice that strcmp is a user-defined
| function---it's easy to look at the code to see what it does.

Perhaps it is easier because it is a user-defined function, but with
Octave, unlike the other leading brand, you can always look at all of
the source code for any of the functions.

| On the other hand, the reason I'm looking at the code is that it is slow.
|
| I'm using strcmp(s1,s2) where s1 is a cell array of strings (around
| 1000X1) and s2 is a string.
|
| Currently, in 2.1.58, strcmp does
|
| n = length(t1);
| for i = 1:n
|    retval(i,:) = strcmp (t1{i}, s2);
| endfor
|
| which recursively calls strcmp for two strings and where t1 is s1(:).
| The for loop is slow.
|
| Short of making strcmp a built-in function, one way to speed it up is
| to replace the for loop with a repmat like this
|
| c1 = char(t1);
| [rc1, cc1] = size(c1);
| ns2 = size(s2,2);
| blank = ' ';
| ms2 = repmat([s2 repmat(blank,1,max(cc1 - ns2, 0))], rc1, 1);
| mc1 = [c1 repmat(blank,1,max(ns2 - cc1, 0))];

| retval2 = sum(mc1==ms2, 2)==size(mc1, 2);

Wouldn't

  all (mc1 == ms2, 2)

be better here?

| The second code fragment is about 10 times faster than the first for
| length(s1)==1000. For length(s1)==10 the speed is about the same. And
| for length(s1) < 10 the first approach is faster.
|
| For me the added complexity in the code is more than offset by the
| gain in speed.

I think we need to be careful about using repmat in this way.  If your
cell array is large and the comparison string is fairly long, it could
easily generate a matrix many megabytes in size.  Following that with
code that generates additional temporary arrays could cause trouble.
Maybe it could be done in blocks rather than all at once?

Or, perhaps it should be a builtin function.

jwe


Reply | Threaded
Open this post in threaded view
|

Re: Vectorizing strcmp

Keith Goodman
Good point. How about this:

The for loop is part of the problem. Another part is that the
recursive call to strcmp needs three or four if statements to
determine that we are asking for a strcmp between two strings. But we
already know that they are strings since we are creating the strings
from the cell.

So we could replace

for i = 1:n
   retval(i,:) = strcmp (t1{i}, s2);
endfor

with

for i = 1:n
   retval(i,:) = isequal (t1 {i}, s2);
endfor

for a small but easy speed up.



On Wed, 15 Sep 2004 13:18:13 -0400, John W. Eaton <[hidden email]> wrote:

> On 15-Sep-2004, Keith Goodman <[hidden email]> wrote:
>
> | On the one hand, it's nice that strcmp is a user-defined
> | function---it's easy to look at the code to see what it does.
>
> Perhaps it is easier because it is a user-defined function, but with
> Octave, unlike the other leading brand, you can always look at all of
> the source code for any of the functions.
>
> | On the other hand, the reason I'm looking at the code is that it is slow.
> |
> | I'm using strcmp(s1,s2) where s1 is a cell array of strings (around
> | 1000X1) and s2 is a string.
> |
> | Currently, in 2.1.58, strcmp does
> |
> | n = length(t1);
> | for i = 1:n
> |    retval(i,:) = strcmp (t1{i}, s2);
> | endfor
> |
> | which recursively calls strcmp for two strings and where t1 is s1(:).
> | The for loop is slow.
> |
> | Short of making strcmp a built-in function, one way to speed it up is
> | to replace the for loop with a repmat like this
> |
> | c1 = char(t1);
> | [rc1, cc1] = size(c1);
> | ns2 = size(s2,2);
> | blank = ' ';
> | ms2 = repmat([s2 repmat(blank,1,max(cc1 - ns2, 0))], rc1, 1);
> | mc1 = [c1 repmat(blank,1,max(ns2 - cc1, 0))];
>
> | retval2 = sum(mc1==ms2, 2)==size(mc1, 2);
>
> Wouldn't
>
>   all (mc1 == ms2, 2)
>
> be better here?
>
> | The second code fragment is about 10 times faster than the first for
> | length(s1)==1000. For length(s1)==10 the speed is about the same. And
> | for length(s1) < 10 the first approach is faster.
> |
> | For me the added complexity in the code is more than offset by the
> | gain in speed.
>
> I think we need to be careful about using repmat in this way.  If your
> cell array is large and the comparison string is fairly long, it could
> easily generate a matrix many megabytes in size.  Following that with
> code that generates additional temporary arrays could cause trouble.
> Maybe it could be done in blocks rather than all at once?
>
> Or, perhaps it should be a builtin function.
>
> jwe
>
>