My vectorized code is slower than a loop

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

My vectorized code is slower than a loop

Dave Cottingham 2
In the process of writing a routine to compute the Qn statistic of Croux and Rousseeuw, I started from the Fortran code they published in their paper (C&R, "Time-efficient algorithms for two highly robust estimators of scale") and ran into unexpected trouble with one section. Here's a fairly literal translation of the original, where x is an array of length n, and left and right are also arrays of length n that contain integers:

w = zeros(1, n);
j = 1
for i = 1:n
  if(left(i) <= right(i))
    for jj = left(i):right(i)
      w(j) = x(i) - x(n+1-jj);
      j += 1;
    endfor
  endif
endfor

Before we get to this section, left and right are such that at most n items will get copied into w.

Not surprisingly, this code is pretty slow. I see no way to eliminate the outside loop on i; but getting rid of the inside loop on jj is straightforward:

w = zeros(1, n);
j = 1
for i = 1:n
  if(left(i) <= right(i))
    w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
    j = j + right(i)-left(i)+1;
  endif
endfor

Problem is, the second, vectorized, code takes twice as long as the first, non-vectorized, code.

Anybody have an idea why? I'm stumped.

I have attached a script that contains these two snippets, some setup, and tics and tocs, for anyone who wants to play with it.

Thanks,
Dave Cottingham

looptest.m
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

nrjank
On Mon, Oct 17, 2016 at 3:37 PM, Dave Cottingham 2
<[hidden email]> wrote:

> In the process of writing a routine to compute the Qn statistic of Croux and
> Rousseeuw, I started from the Fortran code they published in their paper
> (C&R, "Time-efficient algorithms for two highly robust estimators of scale")
> and ran into unexpected trouble with one section. Here's a fairly literal
> translation of the original, where x is an array of length n, and left and
> right are also arrays of length n that contain integers:
>
> w = zeros(1, n);
> j = 1
> for i = 1:n
>   if(left(i) <= right(i))
>     for jj = left(i):right(i)
>       w(j) = x(i) - x(n+1-jj);
>       j += 1;
>     endfor
>   endif
> endfor
>
> Before we get to this section, left and right are such that at most n items
> will get copied into w.
>
> Not surprisingly, this code is pretty slow. I see no way to eliminate the
> outside loop on i; but getting rid of the inside loop on jj is
> straightforward:
>
> w = zeros(1, n);
> j = 1
> for i = 1:n
>   if(left(i) <= right(i))
>     w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
>     j = j + right(i)-left(i)+1;
>   endif
> endfor
>
> Problem is, the second, vectorized, code takes twice as long as the first,
> non-vectorized, code.
>
> Anybody have an idea why? I'm stumped.
>
> I have attached a script that contains these two snippets, some setup, and
> tics and tocs, for anyone who wants to play with it.
>
> Thanks,
> Dave Cottingham


Don't have time to test at the moment, but have you tried using the
profiler? Octave's profiler isn't as thorough as Matlab's, but it
should be able to show you where your scripts are spending the most
time, and maybe why the latter is slower.

profile on;

*run script*

profile off

profshow

and/or

profexplore

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Juan Pablo Carbajal-2
On Mon, Oct 17, 2016 at 10:34 PM, Nicholas Jankowski
<[hidden email]> wrote:

> On Mon, Oct 17, 2016 at 3:37 PM, Dave Cottingham 2
> <[hidden email]> wrote:
>> In the process of writing a routine to compute the Qn statistic of Croux and
>> Rousseeuw, I started from the Fortran code they published in their paper
>> (C&R, "Time-efficient algorithms for two highly robust estimators of scale")
>> and ran into unexpected trouble with one section. Here's a fairly literal
>> translation of the original, where x is an array of length n, and left and
>> right are also arrays of length n that contain integers:
>>
>> w = zeros(1, n);
>> j = 1
>> for i = 1:n
>>   if(left(i) <= right(i))
>>     for jj = left(i):right(i)
>>       w(j) = x(i) - x(n+1-jj);
>>       j += 1;
>>     endfor
>>   endif
>> endfor
>>
>> Before we get to this section, left and right are such that at most n items
>> will get copied into w.
>>
>> Not surprisingly, this code is pretty slow. I see no way to eliminate the
>> outside loop on i; but getting rid of the inside loop on jj is
>> straightforward:
>>
>> w = zeros(1, n);
>> j = 1
>> for i = 1:n
>>   if(left(i) <= right(i))
>>     w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
>>     j = j + right(i)-left(i)+1;
>>   endif
>> endfor
>>
>> Problem is, the second, vectorized, code takes twice as long as the first,
>> non-vectorized, code.
>>
>> Anybody have an idea why? I'm stumped.
>>
>> I have attached a script that contains these two snippets, some setup, and
>> tics and tocs, for anyone who wants to play with it.
>>
>> Thanks,
>> Dave Cottingham
>
>
> Don't have time to test at the moment, but have you tried using the
> profiler? Octave's profiler isn't as thorough as Matlab's, but it
> should be able to show you where your scripts are spending the most
> time, and maybe why the latter is slower.
>
> profile on;
>
> *run script*
>
> profile off
>
> profshow
>
> and/or
>
> profexplore
>
> _______________________________________________
> Help-octave mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/help-octave

Hi,

If you provide code we can directly test or study it will make things easier!
Can you provide a working case?

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Dave Cottingham 2


On Tue, Oct 18, 2016 at 11:43 AM, Juan Pablo Carbajal <[hidden email]> wrote:
On Mon, Oct 17, 2016 at 10:34 PM, Nicholas Jankowski
<[hidden email]> wrote:
> On Mon, Oct 17, 2016 at 3:37 PM, Dave Cottingham 2
> <[hidden email]> wrote:
>> In the process of writing a routine to compute the Qn statistic of Croux and
>> Rousseeuw, I started from the Fortran code they published in their paper
>> (C&R, "Time-efficient algorithms for two highly robust estimators of scale")
>> and ran into unexpected trouble with one section. Here's a fairly literal
>> translation of the original, where x is an array of length n, and left and
>> right are also arrays of length n that contain integers:
>>
>> w = zeros(1, n);
>> j = 1
>> for i = 1:n
>>   if(left(i) <= right(i))
>>     for jj = left(i):right(i)
>>       w(j) = x(i) - x(n+1-jj);
>>       j += 1;
>>     endfor
>>   endif
>> endfor
>>
>> Before we get to this section, left and right are such that at most n items
>> will get copied into w.
>>
>> Not surprisingly, this code is pretty slow. I see no way to eliminate the
>> outside loop on i; but getting rid of the inside loop on jj is
>> straightforward:
>>
>> w = zeros(1, n);
>> j = 1
>> for i = 1:n
>>   if(left(i) <= right(i))
>>     w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
>>     j = j + right(i)-left(i)+1;
>>   endif
>> endfor
>>
>> Problem is, the second, vectorized, code takes twice as long as the first,
>> non-vectorized, code.
>>
>> Anybody have an idea why? I'm stumped.
>>
>> I have attached a script that contains these two snippets, some setup, and
>> tics and tocs, for anyone who wants to play with it.
>>
>> Thanks,
>> Dave Cottingham
>
>
> Don't have time to test at the moment, but have you tried using the
> profiler? Octave's profiler isn't as thorough as Matlab's, but it
> should be able to show you where your scripts are spending the most
> time, and maybe why the latter is slower.
>
> profile on;
>
> *run script*
>
> profile off
>
> profshow
>
> and/or
>
> profexplore
>
> _______________________________________________
> Help-octave mailing list
> [hidden email]
> https://lists.gnu.org/mailman/listinfo/help-octave

Hi,

If you provide code we can directly test or study it will make things easier!
Can you provide a working case?


Was in my original post -- here it is again:

http://octave.1599824.n4.nabble.com/file/n4680192/looptest.m

Thanks,
Dave Cottingham



_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Andreas Weber-6
In reply to this post by Dave Cottingham 2
Am 17.10.2016 um 21:37 schrieb Dave Cottingham 2:
> In the process of writing a routine to compute the Qn statistic of Croux and
> Rousseeuw, I started from the Fortran code they published in their paper
> (C&R, "Time-efficient algorithms for two highly robust estimators of scale")
> and ran into unexpected trouble with one section. Here's a fairly literal
> translation of the original, where x is an array of length n, and left and
> right are also arrays of length n that contain integers:
>
> code snipped

Your script can be simplified to

w = x-x(end:-1:1);

try it out!

Since "if(left(i) <= right(i))" is always true. Perhaps you can provide
a realistic example?

-- Andy

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

nrjank
On Tue, Oct 18, 2016 at 1:38 PM, Andreas Weber <[hidden email]> wrote:

> Am 17.10.2016 um 21:37 schrieb Dave Cottingham 2:
>> In the process of writing a routine to compute the Qn statistic of Croux and
>> Rousseeuw, I started from the Fortran code they published in their paper
>> (C&R, "Time-efficient algorithms for two highly robust estimators of scale")
>> and ran into unexpected trouble with one section. Here's a fairly literal
>> translation of the original, where x is an array of length n, and left and
>> right are also arrays of length n that contain integers:
>>
>> code snipped
>
> Your script can be simplified to
>
> w = x-x(end:-1:1);
>
> try it out!
>
> Since "if(left(i) <= right(i))" is always true. Perhaps you can provide
> a realistic example?
>
> -- Andy


Andy's got it, but for reference here's the profiler timing for your
code.  Andy's code appears to be too fast for the profiler :) Note the
additional operation count in your vectorized code.


vectorized
Elapsed time is 5.917 seconds.

   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   5            binary -             0.131      43.52       400000
   4            binary +             0.121      40.20       400000
   3           binary <=             0.049      16.28       100000
   1                 tic             0.000       0.00            1
   2               zeros             0.000       0.00            1
   6                 toc             0.000       0.00            1
   7             profile             0.000       0.00            1
   8              nargin             0.000       0.00            1
   9           binary !=             0.000       0.00            1
  10               false             0.000       0.00            1
  11 __profiler_enable__             0.000       0.00            1

non-vectorized
Elapsed time is 3.594 seconds.

   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   5            binary -             0.053      47.32       200000
   4            binary +             0.030      26.79       100000
   3           binary <=             0.029      25.89       100000
   1                 tic             0.000       0.00            1
   2               zeros             0.000       0.00            1
   6                 toc             0.000       0.00            1
   7             profile             0.000       0.00            1
   8              nargin             0.000       0.00            1
   9           binary !=             0.000       0.00            1
  10               false             0.000       0.00            1
  11 __profiler_enable__             0.000       0.00            1

Andy supercode
Elapsed time is 0.039 seconds.


   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
warning: division by zero
warning: called from
    profshow at line 85 column 17
    testy at line 59 column 3
   1                 tic             0.000        NaN            1
warning: division by zero
   2                 end             0.000        NaN            1
warning: division by zero
   3            prefix -             0.000        NaN            1
warning: division by zero
   4            binary -             0.000        NaN            1
warning: division by zero
   5                 toc             0.000        NaN            1
warning: division by zero
   6             profile             0.000        NaN            1
warning: division by zero
   7              nargin             0.000        NaN            1
warning: division by zero
   8           binary !=             0.000        NaN            1
warning: division by zero
   9               false             0.000        NaN            1
warning: division by zero
  10 __profiler_enable__             0.000        NaN            1
>>

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Dave Cottingham 2
In reply to this post by Andreas Weber-6
Andreas Weber-6 wrote
Your script can be simplified to

w = x-x(end:-1:1);

try it out!

Since "if(left(i) <= right(i))" is always true. Perhaps you can provide
a realistic example?

-- Andy

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
I try to keep the examples short -- and the computation of a realistic example is a bit involved -- so rather than post the code to compute the example, here's an octave saveset with a realistic example, and a modified "looptest.m" that reads it and runs the timing comparison.

Thanks,
Dave Cottingham

looptest.dat
looptest.m
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

nrjank
On Wed, Oct 19, 2016 at 10:26 AM, Dave Cottingham 2
<[hidden email]> wrote:

> Andreas Weber-6 wrote
>> Your script can be simplified to
>>
>> w = x-x(end:-1:1);
>>
>> try it out!
>>
>> Since "if(left(i) <= right(i))" is always true. Perhaps you can provide
>> a realistic example?
>>
>> -- Andy
>>
>> _______________________________________________
>> Help-octave mailing list
>
>> Help-octave@
>
>> https://lists.gnu.org/mailman/listinfo/help-octave
>
> I try to keep the examples short -- and the computation of a realistic
> example is a bit involved -- so rather than post the code to compute the
> example, here's an octave saveset with a realistic example, and a modified
> "looptest.m" that reads it and runs the timing comparison.
>
> Thanks,
> Dave Cottingham
>
> looptest.dat
> <http://octave.1599824.n4.nabble.com/file/n4680245/looptest.dat>
> looptest.m <http://octave.1599824.n4.nabble.com/file/n4680245/looptest.m>
>


Yes, so Andy's code no longer produces an equivalent w.  Also, I
didn't look to see if you changed any of the code, but there is now
less of a penalty for the vectorized function.

Again, use the profiler. it will show you where you're code is
spending time.  Your vectorized code is simply making almost twice as
many binary operation calls (500,000 vs 325000), eating up any of the
non-looping benefit. here's the summary:

vectorized
Elapsed time is 6.26725 seconds.
   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   4            binary -             0.122      41.92       199956
   3            binary +             0.111      38.15       199956
   2           binary <=             0.058      19.93       100000
   1                 tic             0.000       0.00            1
   5                 toc             0.000       0.00            1
   6             profile             0.000       0.00            1
   7              nargin             0.000       0.00            1
   8           binary !=             0.000       0.00            1
   9               false             0.000       0.00            1
  10 __profiler_enable__             0.000       0.00            1

looped
j =  1
Elapsed time is 5.94019 seconds.
   #            Function Attr     Time (s)   Time (%)        Calls
------------------------------------------------------------------
   4            binary -             0.084      40.38       150344
   2           binary <=             0.062      29.81       100000
   3            binary +             0.062      29.81        75172
   1                 tic             0.000       0.00            1
   5                 toc             0.000       0.00            1
   6             profile             0.000       0.00            1
   7              nargin             0.000       0.00            1
   8           binary !=             0.000       0.00            1
   9               false             0.000       0.00            1
  10 __profiler_enable__             0.000       0.00            1

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Francesco Potortì
In reply to this post by Dave Cottingham 2
No profiling, no runs, just looking at the code, so I may be off.

w = zeros(1, n);
j = 1
for i = 1:n
  if (vectorised)
    if(left(i) <= right(i))
      w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
      j = j + right(i)-left(i)+1;
    endif
  else # looping
    if(left(i) <= right(i))
      for jj = left(i):right(i)
        w(j) = x(i) - x(n+1-jj);
        j += 1;
      endfor
  endif
endfor

In the loop version, the if is useless, because the for range will be
empty, so the for will not run when the if clause is false.  This may
result in a speedup or not.

In the vectorised version, I would write it like this, which minimises
the number of operations at the expense of the number of instructions,
which may result in a speedup or not:

   li = left(i);
   d = right(i) - li;
   if (d >= 0)
     dr = 0:d;
     w(j+dr) = x(i) - x(n+1-li+dr);
     j += d+1;
   endif

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

Dave Cottingham 2
I had come to a similar conclusion, from looking at (and reproducing) Nicholas' timing results. In my original vectorized version, the operations spent computing the index ranges are more than what's saved by vectorizing. I did some precomputing of the ranges, similar to yours, and managed to get the vectorized version no slower than the loop.

As dismaying as that is, I think I see why: at each i, the number of things copied is a few or none, so there's nothing to be gained by vectorizing. This code fragment can only be sped up by eliminating the loop on i -- and I see no way to do that.

My conclusion is there's no alternative to implementing this is c++ (which I've already done).

Thanks,
Dave Cottingham


On Wed, Oct 19, 2016 at 4:19 PM, Francesco Potortì <[hidden email]> wrote:
No profiling, no runs, just looking at the code, so I may be off.

w = zeros(1, n);
j = 1
for i = 1:n
  if (vectorised)
    if(left(i) <= right(i))
      w(j:j+right(i)-left(i)) = x(i) - x(n+1-[left(i):right(i)]);
      j = j + right(i)-left(i)+1;
    endif
  else                          # looping
    if(left(i) <= right(i))
      for jj = left(i):right(i)
        w(j) = x(i) - x(n+1-jj);
        j += 1;
      endfor
  endif
endfor

In the loop version, the if is useless, because the for range will be
empty, so the for will not run when the if clause is false.  This may
result in a speedup or not.

In the vectorised version, I would write it like this, which minimises
the number of operations at the expense of the number of instructions,
which may result in a speedup or not:

   li = left(i);
   d = right(i) - li;
   if (d >= 0)
     dr = 0:d;
     w(j+dr) = x(i) - x(n+1-li+dr);
     j += d+1;
   endif

--
Francesco Potortì (ricercatore)        Voice:  <a href="tel:%2B39.050.621.3058" value="+390506213058">+39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: <a href="tel:%2B39.348.8283.107" value="+393488283107">+39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it



_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: My vectorized code is slower than a loop

nrjank
On Thu, Oct 20, 2016 at 10:50 AM, Dave Cottingham
<[hidden email]> wrote:
>
> My conclusion is there's no alternative to implementing this is c++ (which
> I've already done).
>

You may find it worth posting the looped code and your most recent attempt over at StackOverflow.com and seeing if anyone over there can vectorize it more efficiently. Some people monitoring Matlab/Octave posts there are pretty sharp on that subject.

nickj

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave