Faster imread and imwrite

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Faster imread and imwrite

Mihai Babiac
Hello everyone,

First of all thanks for making this great app!

Recently I've been working on a project which involves generating a lot of images, processing them externally and then getting some statistics from the resulting images in Octave. The images are 4096x4096 16-bit uncompressed TIFFs, so around 100MB each. I noticed that a big chunk of time in the entire process is taken only by reading and writing the files so I started to look how I could optimize that. Here are my findings.

Currently it's faster to convert the TIFFs to raw RGB images and manually read that with fread (~1.7x faster for a 16bit image and 3x faster for an 8bit image for me)
system(["gm convert " fname " " fname ".rgb"]);
fid = fopen([fname ".rgb"]);
im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')', 4096, 4096, 3), [2 1 3]);
fclose(fid);
delete([fname ".rgb"]);

Thinking that hack was a bit silly, decided to look at the source code of Octave itself. So I did some profiling and noticed that a big part of the read time is spent adjusting the variable range from the GraphicsMagick internal representation to the actual bit depth. This is currently done by first converting everything to double, dividing, rounding and converting to unsigned int (I'm mostly investigating the "TrueColorType" part of read_images). In almost all of the cases (except for 32bit file depth where for some reason Octave wants to give a normalized double result) these operations can be done by bit shifting. If a simple bit shift is done, performance improves by 3.4x for 16bit and 3x for 8bit. If it is done by having an if-branch for each shift amount (0, 8 or 24), constant shift values inside the per-pixel for-loops, it is even faster, around 5x for both 16 and 8bit. I wonder how/if this code could be written in a nice and concise way, so as not to have 3 copies of the same code. I also don't know how the 32 bit case should be handled, but it really looks like a corner case to me. I guess most users don't have GraphicsMagick compiled with quantum-depth 32 and floating-point images are quite rare. Having the user do the conversion from uint32 to float on his own might not be such a bad idea, but the thought of breaking existing code sounds bad

In the case of imwrite, things are much nicer. Here I'm mostly investigating the "TrueColorType" part of encode_uint_image.The big time-waster is the construction and destruction of a Magick::Color object for each and every pixel, which internally calls new and delete. It turns out that the output values can be written directly to the output vector, without an intermediary Color object. That alone improves performance by more than 2.5x. It gets even faster if integer operations are used (4x with multiplications), but there are two cases: the one when the Octave variables are smaller than the quantum depth and the one when they are bigger. In the first case we need to multiply with a constant dependent on the width of template type and the quantum depth, in the second we would need to shift with a value dependent on those depths, just as was done for imread. Unfortunately I don't have a lot of experience with templates so I don't know how this can be done without duplicating code.

Just a final note, I stored and read the images to and from a tmpfs filesystem, so the speedups might be a lot smaller for a HDD. In case anyone wonders (I know I did) whether division and rounding can really be done just by bitshifting here's some proof for one of the cases
 v = uint8(0:2^8-1);
 v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
 v_rec = bitshift(v_mag, -8);
 isequal(v_rec, v)

What do you think about all this? Is it worth the effort? If yes, I could try cleaning my code up, extending it for all the cases (rgb, grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'm not exactly sure what the process is for getting involved.


All the best,
Mihai B


Reply | Threaded
Open this post in threaded view
|

Re: Faster imread and imwrite

Daniel Sebald
On 12/10/2017 12:35 PM, Mihai Babiac wrote:

> Hello everyone,
>
> First of all thanks for making this great app!
>
> Recently I've been working on a project which involves generating a lot
> of images, processing them externally and then getting some statistics
> from the resulting images in Octave. The images are 4096x4096 16-bit
> uncompressed TIFFs, so around 100MB each. I noticed that a big chunk of
> time in the entire process is taken only by reading and writing the
> files so I started to look how I could optimize that. Here are my findings.
>
> Currently it's faster to convert the TIFFs to raw RGB images and
> manually read that with fread (~1.7x faster for a 16bit image and 3x
> faster for an 8bit image for me)
> system(["gm convert " fname " " fname ".rgb"]);
> fid = fopen([fname ".rgb"]);
> im_fast = permute(reshape(fread(fid, [3 Inf], "*uint16", 'ieee-be')',
> 4096, 4096, 3), [2 1 3]);
> fclose(fid);
> delete([fname ".rgb"]);
>
> Thinking that hack was a bit silly, decided to look at the source code
> of Octave itself. So I did some profiling and noticed that a big part of
> the read time is spent adjusting the variable range from the
> GraphicsMagick internal representation to the actual bit depth. This is
> currently done by first converting everything to double, dividing,
> rounding and converting to unsigned int (I'm mostly investigating the
> "TrueColorType" part of read_images). In almost all of the cases (except
> for 32bit file depth where for some reason Octave wants to give a
> normalized double result) these operations can be done by bit shifting.
> If a simple bit shift is done, performance improves by 3.4x for 16bit
> and 3x for 8bit. If it is done by having an if-branch for each shift
> amount (0, 8 or 24), constant shift values inside the per-pixel
> for-loops, it is even faster, around 5x for both 16 and 8bit. I wonder
> how/if this code could be written in a nice and concise way, so as not
> to have 3 copies of the same code. I also don't know how the 32 bit case
> should be handled, but it really looks like a corner case to me. I guess
> most users don't have GraphicsMagick compiled with quantum-depth 32 and
> floating-point images are quite rare. Having the user do the conversion
> from uint32 to float on his own might not be such a bad idea, but the
> thought of breaking existing code sounds bad
>
> In the case of imwrite, things are much nicer. Here I'm mostly
> investigating the "TrueColorType" part of encode_uint_image.The big
> time-waster is the construction and destruction of a Magick::Color
> object for each and every pixel, which internally calls new and delete.
> It turns out that the output values can be written directly to the
> output vector, without an intermediary Color object. That alone improves
> performance by more than 2.5x. It gets even faster if integer operations
> are used (4x with multiplications), but there are two cases: the one
> when the Octave variables are smaller than the quantum depth and the one
> when they are bigger. In the first case we need to multiply with a
> constant dependent on the width of template type and the quantum depth,
> in the second we would need to shift with a value dependent on those
> depths, just as was done for imread. Unfortunately I don't have a lot of
> experience with templates so I don't know how this can be done without
> duplicating code.
>
> Just a final note, I stored and read the images to and from a tmpfs
> filesystem, so the speedups might be a lot smaller for a HDD. In case
> anyone wonders (I know I did) whether division and rounding can really
> be done just by bitshifting here's some proof for one of the cases
>   v = uint8(0:2^8-1);
>   v_mag = uint16(v)*uint16((2^16-1)/(2^8-1));
>   v_rec = bitshift(v_mag, -8);
>   isequal(v_rec, v)
>
> What do you think about all this? Is it worth the effort? If yes, I
> could try cleaning my code up, extending it for all the cases (rgb,
> grayscale, w/ alpha, w/o alpha etc.) and sending you a changeset? I'm
> not exactly sure what the process is for getting involved.
>
>
> All the best,
> Mihai B

Hi Mihai,

I wasn't aware that a division was done as part of the main loop in a
lot of these image load/save routines.  I'm looking at the code right
now and I'll give my thoughts.

First, efficiency is always welcome, but keep in mind that these patches
often get lost in the ether because bug-fixes seem to take higher
priority, so I'd say before diving in to be aware of that.  So just try
a few things and think it over for now.  (Octave has a patch-tracker
which can be used to keep track of things.)

Any routine that processes large data should have the most efficient
central loop as possible.  The code does seem to attempt to shape the
row/column loops in that philosophy.  But yes, division is the worst
thing to do because if it is simulated division it's very slow.  If it
is co-processor division that's faster but I would think there is still
some overhead with that because of loading the co-processor and division
might be more machine cycles than say simple multiplication or
bitshift/load.

I used the word "load" because sometimes processors can accomplish a
16-bit shift simply by loading a high versus low register.  That kind of
thing.

Given that, I'd say yes there is room for optimization, and it would
entail a bit of analysis at the front of the routine to see whether
there are simpler means of accomplishing the division.  For example,
analyze the division and ask "Is this division base 2, i.e., bit
shift?", "Does this division boil down to an 8-bit or 16-bit shift and
if so can it be written as some type of integer load that you know will
translate to just one or two instructions via an optimizing C compiler?"
  That sort of thing.  So, the code might have a subset of routines or
macros that are real efficient bit shifts and register loads, each
called appropriate to the analyzed scale factor.

There is another benefit to what I described.  A floating point division
can introduce some inaccuracy if not done correctly, whereas bit shifts
and register loads are sure to be accurate in terms of image data bits.
That is, if we were to save image data with imwrite() we'd like to be
assured that imread() will read back that data to bit-wise accuracy,
i.e., the same exact data...not something for which a LSB was altered do
to floating-point math.  (There should be a set of tests for write/read
accuracy.)

There is one very simple thing you could try here, in terms of
efficiency, and evaluate against the other numbers you are telling us.
Rather than have that division in the main loop, change it to something
like (pseudocode):

loop_multiplier = 1/divisor;
for (column loop)
   for (row loop)
     img_fvec[idx++] = pix->red * loop_multiplier;
   end
end

Multiplies are often faster than divisions.  What types of speeds might
you see for that kind of construct?  If that improves efficiency in and
of itself, you could make a very simple patch first and then leave
greater improvement for later.

Dan