Modifications to hist.m

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Modifications to hist.m

Andy Adler
I propose the following patch to hist.m;
it results in about 2.5x speedup.

octave:1> im=rand(500^2,1);
octave:2> jnk= hist(im,100); tic;v1=hist(im,100);toc
ans = 7.1460
octave:3> jnk= hist_orig(im,100);tic;v2=hist_orig(im,100);toc
ans = 17.604
octave:4> std(v1-v2)
ans = 0


$ diff -u hist.m /usr/local/oct2145/share/octave/2.1.45/m/plot/hist.m
--- hist.m      2003-03-15 15:47:16.000000000 -0500
+++ /usr/local/oct2145/share/octave/2.1.45/m/plot/hist.m        2003-02-27
19:22
:32.000000000 -0500
@@ -87,11 +87,12 @@
     endif
   endif

-  cumfreq = [zeros (1, n), length(y) ];
-  for i = 1:n-1
-    cumfreq (i+1) = sum (y < cutoff (i));
-  end
-  freq= diff(cumfreq);
+  freq = zeros (1, n);
+  freq (1) = sum (y < cutoff (1));
+  for i = 2:n-1
+    freq (i) = sum (y >= cutoff (i-1) & y < cutoff (i));
+  endfor
+  freq (n) = sum (y >= cutoff (n-1));

   if (nargin == 3)
     ## Normalise the histogram.


Andy


Reply | Threaded
Open this post in threaded view
|

Re: Modifications to hist.m

Paul Kienzle
Andy Adler wrote:

>I propose the following patch to hist.m;
>it results in about 2.5x speedup.
>
At one point a rewrote hist to not have any
loops.  I was doing a whole lot of really large
histograms (e.g., 100000 values drawn from
a poisson distribution --- I had to rewrite the
poisson generator too ;-)

Please check it out and tell me if it is any faster
than what you've got.  I think it might be slower
if you have a histogram with a lot of empty bins.

Thanks,

Paul Kienzle
[hidden email]


## Copyright (C) 1996, 1997 John W. Eaton
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING.  If not, write to the Free
## Software Foundation, 59 Temple Place - Suite 330, Boston, MA
## 02111-1307, USA.

## -*- texinfo -*-
## @deftypefn {Function File} {} hist (@var{y}, @var{x}, @var{norm})
## Produce histogram counts or plots.
##
## With one vector input argument, plot a histogram of the values with
## 10 bins.  The range of the histogram bins is determined by the range
## of the data.
##
## Given a second scalar argument, use that as the number of bins.
##
## Given a second vector argument, use that as the centers of the bins,
## with the width of the bins determined from the adjacent values in
## the vector.
##
## If third argument is provided, the histogram is normalised such that
## the sum of the bars is equal to @var{norm}.
##
## Extreme values are lumped in the first and last bins.
##
## With two output arguments, produce the values @var{nn} and @var{xx} such
## that @code{bar (@var{xx}, @var{nn})} will plot the histogram.
## @end deftypefn
## @seealso{bar}

## Author: jwe

function [nn, xx] = hist (y, x, norm)

  if (nargin < 1 || nargin > 3)
    usage ("[nn, xx] = hist (y, x, norm)");
  endif

  if (is_vector (y))
    max_val = max (y);
    min_val = min (y);
  else
    error ("hist: first argument must be a vector");
  endif

  if (nargin == 1)
    x = 10;
  endif

  if (is_scalar (x))
    n = x;
    if (n <= 0 || n != fix(n))
      error ("hist: number of bins must be a positive integer");
    endif
    delta = (max_val - min_val) / n / 2;
    x = linspace (min_val+delta, max_val-delta, n);

    freq = zeros(1,n);
    q = sort(y(:).');
    L = length(q);
    if (q(1) == q(L))
      freq(n) = L;
    else
      q = (q - q(1))/(q(L)-q(1))/(1+eps); # set y-range to [0,1)
      q = fix(q*n);                  # split into n bins
      same = ( q == [q(2:L),-Inf] ); # true if neighbours are in the same bin
      q = q(~same);                  # q lists the 'active' bins (0-origin)
     
      f = cumsum(same);              # cumulative histogram
      f = f(~same);
      f = [f(1), diff(f)] + 1;       # cumulative histogram -> histogram
      ## we need to add 1 since we did not count the point at the
      ## boundary between bins (it was turned into zero)
     
      ## distribute f to the active bins, leaving the remaining bins empty
      freq(q+1) = f;
    endif
  elseif (is_vector (x))
    tmp = sort (x);
    if (any (tmp != x))
      warning ("hist: bin values not sorted on input");
      x = tmp;
    endif
    n = length(x);
    cutoff = ( x(1:n-1) + x(2:n) ) / 2;    # find bin boundaries
    [s, idx] = sort ( [cutoff(:); y(:)] ); # put elements between boundaries
    chist = cumsum(idx>n);                 # integrate over all elements
    chist = [chist(idx<n); chist(length(chist))];  # keep totals at boundaries
    freq = [chist(1); diff(chist) ];       # differentiate for histogram
  else
    error ("hist: second argument must be a scalar or a vector");
  endif

  if (nargin == 3)
    ## Normalise the histogram.
    freq = freq / length(y) * norm;
  endif

  if (nargout > 0)
    nn = freq;
    xx = x;
  else
    bar (x, freq);
  endif

endfunction
Reply | Threaded
Open this post in threaded view
|

Re: Modifications to hist.m

Andy Adler
Paul,

That's some really nice code - Tried to think of a faster
solution that would not eat too much more memory, but
yours is great. It's slightly more expensive for a small
number of bins, but the crossover is about 30, in my example

r=[]; i=0;
for sz= [100, 200, 500];
    x=rand(sz^2,1);
    for n=[10,30,100];  i++;
      tic; h1=hist1(x,n);r(i,1)= toc;
      tic; h2=hist2(x,n);r(i,2)= toc;
      tic; h3=hist3(x,n);r(i,3)= toc;
    end
end

          CURRENT   ANDY    PAUL
SZ=100
   n=10    0.059   0.032   0.077
   n=30    0.172   0.079   0.085
   n=100   0.575   0.245   0.086
SZ=200
   n=10    0.242   0.109   0.344
   n=30    0.754   0.312   0.335
   n=100   2.526   1.016   0.350
SZ=500
   n=10    1.688   0.733   2.831
   n=30    9.110   2.237   4.967
   n=100  32.847  13.291   5.431

Andy



Reply | Threaded
Open this post in threaded view
|

Re: Modifications to hist.m

Alois Schlögl-2
In reply to this post by Paul Kienzle




Paul Kienzle wrote:

> Andy Adler wrote:
>
>> I propose the following patch to hist.m;
>> it results in about 2.5x speedup.
>>
> At one point a rewrote hist to not have any
> loops.  I was doing a whole lot of really large
> histograms (e.g., 100000 values drawn from
> a poisson distribution --- I had to rewrite the
> poisson generator too ;-)
>
> Please check it out and tell me if it is any faster
> than what you've got.  I think it might be slower
> if you have a histogram with a lot of empty bins.
>
> Thanks,
>
> Paul Kienzle
> [hidden email]
>



The original HIST.M was also to slow for me. My solution was
implementing HISTO.M (see octave-forge/extra/tsa/ ) and the companion
functions histo2, histo3, histo4 (histo2, histo3, and histo4 differ only
if the input has multiple columns). All HISTO.M's use SORT.

However, HISTO has another difference: each sample value gets its own
bin. For this reason, HISTO.M is not 1-to-1 compatible to HIST.M. I
prefer the result from HISTO, because the bins do not depend on (max and
min of) the samples as in HIST.


Alois









Reply | Threaded
Open this post in threaded view
|

RE: Modifications to hist.m

Lippert, Ross A.
In reply to this post by Andy Adler

I was recently helping someone at work here out with a faster
hist.  This one is simpler, but perhaps it is broken in some way
that others here will not want to use it.

I did something like this:
  if (size(y,1) == 1)
     [ans,I] = sort([y cutoff -inf]);
  elseif (size(y,2) == 1)
     [ans,I] = sort([y' cutoff -inf]);
  else
     error("y must be a row or vector");
  endif

  freq = diff(find(I > length(y)))-1;
where cutoff is computed just as it was computed before and y is the given
data.  There are some +/- inf issues which may be problematic with the above
so perhaps one could instead do:
   [ans,I] = sort([y cutoff]);
   freq = diff( [0 find(I > length(y))]  ) - 1;
and get the same effect.

I figured that the extra log(n) of a C sort isn't really consequential.


-r

-----Original Message-----
From: Paul Kienzle [mailto:[hidden email]]
Sent: Saturday, March 15, 2003 9:43 PM
To: Andy Adler
Cc: [hidden email]
Subject: Re: Modifications to hist.m


Andy Adler wrote:

>I propose the following patch to hist.m;
>it results in about 2.5x speedup.
>
At one point a rewrote hist to not have any
loops.  I was doing a whole lot of really large
histograms (e.g., 100000 values drawn from
a poisson distribution --- I had to rewrite the
poisson generator too ;-)

Please check it out and tell me if it is any faster
than what you've got.  I think it might be slower
if you have a histogram with a lot of empty bins.

Thanks,

Paul Kienzle
[hidden email]