On Wed, Dec 19, 2018 at 11:09:51AM -0800, Rik wrote:

> Recently, sliding windows of statistics functions (movXXX) were added to

> Octave. All of these functions depend on a base function movfun. As part

> of implementing movfun I used the profiler to examine the performance and I

> found a hot spot responsible for 70% of the run time. Unfortunately, I

> couldn't see any way to improve things without jumping out of the Octave

> language in to C++. I'm hoping someone else might look at this and have an

> idea.

>

> The code is below

>

> --- Code Start ---

> ## Apply "shrink" boundary conditions

> ## Function is not applied to any window elements outside the original data.

> function y = shrink_bc (fcn, x, idxp, win, wlen, odim)

> N = length (x);

> idx = idxp + win;

> tf = (idx > 0) & (idx <= N); # idx inside boundaries

>

> n = length (idxp);

> y = zeros (n, odim);

> ## FIXME: This nested for loop accounts for 70% of running time.

> ## Given that "shrink" is the default Endpoint value this

> ## code needs to be reworked.

> for i = 1:n

> k = idx(tf(:,i),i);

> y(i,:) = fcn (x(k));

> endfor

> endfunction

> --- Code End ---

>

> In my case, the following conditions are a suitable example

>

> +verbatim+

> fcn = @mean;

> x = randi (1e4, [1000, 1]);

> idxp = 1:25;

> win = (-25:25).';

> wlen = [51, 51];

> odim = 1;

> -verbatim-

>

> Indexing in Octave is insanely fast, but this is a weird case because it is

> not simply linear indexing [x(5)], nor is it logical indexing [x(x < 1)].

> It is a hybrid where the indices are linear (as would be returned from

> find()) but only some of them are valid so they need to be masked with a

> logical value.

>

> When I run this code repeatedly in a benchmark it requires ~3.0

> milliseconds, but in the overall code it is called 2000 times so the

> function is taking ~6 seconds which feels long.

>

> In this example, the value of k for i = 1 is 1:26, for i = 2, 1:27, etc., etc.

>

> I tried recoding using cellfun

>

> idx = cellfun (@(n) (1:n), num2cell ((wlen(1) - idxp(end)):(wlen(end)-1)),

> 'uniformoutput', false);

> y = cellfun (@(i) fcn (x(i)), idx);

>

> but this takes ~4.5 milliseconds to run.

>

> Anybody have any good tricks to try?

>

> If not, it makes me wonder whether a C++ function that did this mixed

> indexing could be useful. Another feature which is not implemented for

> movfun is "includenan"/"omitnan". The core need is similar to that above

> in that I have a function handle, some linear indices, and some logical

> indices (based on isnan()).

>

> --Rik

Two ideas: first, if this movfun is not intended to be called with

arbitrary user-supplied functions, but only with mean, median, std, sum,

var, max, min to provide the mov*-functions introduced in Matlab R2016a

(the first two comments in

https://blogs.mathworks.com/loren/2016/04/15/moving-along/ seem to imply

that Matlab does not provide movfun), then I would change the

architecture of movfun to do a common preprocessing, but then a

switch-statement, which in the case of shrink boundary conditions

expands the data by

+verbatim+

data_=[fill*ones(wlen,1);data;fill*ones(wlen,1)];

dat=reshape(data_(wlen+(1:N)+(-wlen:wlen)'),N,2*wlen+1);

-verbatim-

where fill is -inf for max, +inf for min, and zero for all else (apart

from median), and then use the fact that you can feed dat to all the

functions above and specify that you want to do your operation along the

first dimension (which is the default anyways). For mean you have to do

a renormalization of the first and last wlen entries to undo the fact

that you averaged zero to it, and for std and var you have to do this in

two rounds, first computing the mean of the values as indicated above,

and then summing the deviations. mad and prod, which are not provided by

Matlab, obviously also can be done in this way. For median, it seems

that you will need a for loop for the first and last wlen entries

(because there is no appropriate fill apart from NaN), but probably N

will typically be much larger than wlen in any case. I am quite sure

that for not too large wlen (say 10) this will be orders of magnitude

faster than the present implementation. Note that Matlab has for all

these functions the argument nanflag (i.e., sum([1 NaN],'omitnan')

should give 1), which seems to constitute as of now an unreported(?)

Matlab incompatibility bug. If octave provides this argument also, the

logic would become much easier, because you could always use nan as filler.

Further, for nan handling in mov* you also have to replace every nan by

fill, and normalize it out in the case of mean.

If you really want to be fast, there are of course more efficient

methods: computing a cumsum, shifting by wlen*2+1 and subtracting from

itself will give you movsum, from which you can derive mean, var and

std. This has complexity O(N) for K<N, compared to O(N*K) for the case

above. But there can be issues of overflow for integer types and lost

precision for doubles. min, max and and median can only be done in

O(N*log(K)), e.g. by a heap-based approach, but this is of course only

possible in compiled code.

Michael