FFT slowup in 2.1.55

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

FFT slowup in 2.1.55

David Bateman-3
Ok, due to the alignment issues in FFTW 3.x I've found I can get
into the case where I'm calling exactly the same FFT, but the
alignment changes each time I call the function. The planner in
oct-fftw.cc then recreates the plan each time, which is slow.
For small FFT's [I've tested fft(randn(64,45))], this results in
a significant slowup.

I've tried two lines of attack to fix this problem. The first is to
try and force the 16-byte alignment in "class Array<T>" using

#ifdef HAVE_ATTRIB_ALIGN
    typedef T alignedT __attribute__ ((aligned(16)));
#else
    typedef T alignedT;
#endif

where HAVE_ATTRIB_ALIGN is a configure time option. I then replace the
"new T" in ArrayRep with "new alignedT". This appears to only give me
8-byte alignment, but I've read in the gcc manual that ld might not be
able to do better than 8-byte alignment, so it is not clear to me if

1) The code I've done is correct, but the linker won't give me 16-byte
alignment, or
2) I've missing somewhere which also needs 16-byte alignment.

Does anyone have any ideas on this?

Secondly, I force the use of FFTW_UNALIGNED when the data is not
aligned on a 16 byte boundary. I can then reuse this plan with any
unaligned data. Also, to prevent the continual switching between an
aligned and unaligned plan, if I have a good unaligned good, I don't
recreate a plan even if the data is now aligned.

With this change I've managed to fix the symptoms of the problem, and
have regained the speed of FFTW3. However, I'd really like to get the
16-byte alignment working, so any clues here would be appreciated.

I attach a patch, that can probably be applied as is...

Regards
David


--
David Bateman                                [hidden email]
Motorola CRM                                 +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 1 69 35 77 01 (Fax)
91193 Gif-Sur-Yvette FRANCE

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary
Reply | Threaded
Open this post in threaded view
|

Re: FFT slowup in 2.1.55

David Bateman-3
Sorry to answer my own mail, but the previous patch had a stupid little bug...

D.

--
David Bateman                                [hidden email]
Motorola CRM                                 +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 1 69 35 77 01 (Fax)
91193 Gif-Sur-Yvette FRANCE

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary
Reply | Threaded
Open this post in threaded view
|

FFT slowup in 2.1.55

John W. Eaton-6
In reply to this post by David Bateman-3
On 27-Feb-2004, David Bateman <[hidden email]> wrote:

| Ok, due to the alignment issues in FFTW 3.x I've found I can get
| into the case where I'm calling exactly the same FFT, but the
| alignment changes each time I call the function. The planner in
| oct-fftw.cc then recreates the plan each time, which is slow.
| For small FFT's [I've tested fft(randn(64,45))], this results in
| a significant slowup.
|
| I've tried two lines of attack to fix this problem. The first is to
| try and force the 16-byte alignment in "class Array<T>" using
|
| #ifdef HAVE_ATTRIB_ALIGN
|     typedef T alignedT __attribute__ ((aligned(16)));
| #else
|     typedef T alignedT;
| #endif
|
| where HAVE_ATTRIB_ALIGN is a configure time option. I then replace the
| "new T" in ArrayRep with "new alignedT". This appears to only give me
| 8-byte alignment, but I've read in the gcc manual that ld might not be
| able to do better than 8-byte alignment, so it is not clear to me if
|
| 1) The code I've done is correct, but the linker won't give me 16-byte
| alignment, or
| 2) I've missing somewhere which also needs 16-byte alignment.
|
| Does anyone have any ideas on this?

Doesn't the __attribute__ ((aligned(16)) qualifier apply to variables,
not types?  It seems to me that attaching it to a typedef would have
no effect.

In any case, the variable you want to align in the Array classes
doesn't actually exist in the class.  We only have a pointer to it,
and you don't really want the pointer to be aligned on a 16-bit
boundary, you want the object that it points to to be aligned.  So I
think you have to write an allocator that will do that for you.  So
instead of writing

  explicit ArrayRep (int n) :
    data (new alignedT [n]), len (n), count (1) { }

I think it might work to write

  explicit ArrayRep (int n) :
    data (make_aligned_16_double_array (n)), len (n), count (1) { }

where "make_aligned_16_double_array" is a function that returns a
pointer to double which has the elements of the allocated array
aligned on a 16-byte boundary.  To do that, you will probably need to
allocate a char array of the appropriate size (might require up to 15
bytes of padding) then return a pointer to one of the first 15
elements with a cast to make it a pointer to double instead of a
pointer to char.

But the solution above won't quite work because ArrayRep is a template
class.  So we need to do this only for some values of the template
type paramater T (double and maybe Complex).  Some method of
specializing these functions is needed.  Here is a simplified example
that I think should work (I leave it up to you to determine how to
compute the necessary offset so that the data pointed to by the pointer
returned from the make_aligned_16_double_array function is actually on
a 16-byte boundary).

#include <iostream>

template <class T>
class
Array
{
public:

  class ArrayRep
  {
  public:

    T *data;
    int len;

    explicit ArrayRep (int n)
      : data (new T [n]), len (n) { std::cerr << "generic" << std::endl; }
  };

  ArrayRep *rep;

  explicit Array (int n)
    : rep (new typename Array<T>::ArrayRep (n)) { }
};

double *
make_aligned_16_double_array (int n)
{
  // Do something magic to find the required offset (should be in the
  // range [0, 15].
  int offset = 0;

  char *buf = new char [n * sizeof (double) + offset];

  return reinterpret_cast<double *> (&buf[offset]);
}

template <>
Array<double>::ArrayRep::ArrayRep (int n)
  : data (make_aligned_16_double_array (n)), len (n)
{
  std::cerr << "double" << std::endl;
}

int
main (void)
{
  Array<int> int_ra (2);
  Array<double> double_ra (2);
}

Or am I missing a simpler solution?

jwe


Reply | Threaded
Open this post in threaded view
|

Re: FFT slowup in 2.1.55

David Bateman-3
According to John W. Eaton <[hidden email]> (on 02/27/04):
> Doesn't the __attribute__ ((aligned(16)) qualifier apply to variables,
> not types?  It seems to me that attaching it to a typedef would have
> no effect.

Damn, you're right. I'd thought with the typedef I was getting
around this, but I'm obviously not. Still the patch to oct-fftw.cc
is still valid and results in a speedup.

> In any case, the variable you want to align in the Array classes
> doesn't actually exist in the class.  We only have a pointer to it,
> and you don't really want the pointer to be aligned on a 16-bit
> boundary, you want the object that it points to to be aligned.  So I
> think you have to write an allocator that will do that for you.  So
> instead of writing
>
>   explicit ArrayRep (int n) :
>     data (new alignedT [n]), len (n), count (1) { }
>
> I think it might work to write
>
>   explicit ArrayRep (int n) :
>     data (make_aligned_16_double_array (n)), len (n), count (1) { }
>
> where "make_aligned_16_double_array" is a function that returns a
> pointer to double which has the elements of the allocated array
> aligned on a 16-byte boundary.  To do that, you will probably need to
> allocate a char array of the appropriate size (might require up to 15
> bytes of padding) then return a pointer to one of the first 15
> elements with a cast to make it a pointer to double instead of a
> pointer to char.
>
> But the solution above won't quite work because ArrayRep is a template
> class.  So we need to do this only for some values of the template
> type paramater T (double and maybe Complex).  Some method of
> specializing these functions is needed.  Here is a simplified example
> that I think should work (I leave it up to you to determine how to
> compute the necessary offset so that the data pointed to by the pointer
> returned from the make_aligned_16_double_array function is actually on
> a 16-byte boundary).
>
> #include <iostream>
>
> template <class T>
> class
> Array
> {
> public:
>
>   class ArrayRep
>   {
>   public:
>
>     T *data;
>     int len;
>
>     explicit ArrayRep (int n)
>       : data (new T [n]), len (n) { std::cerr << "generic" << std::endl; }
>   };
>
>   ArrayRep *rep;
>
>   explicit Array (int n)
>     : rep (new typename Array<T>::ArrayRep (n)) { }
> };
>
> double *
> make_aligned_16_double_array (int n)
> {
>   // Do something magic to find the required offset (should be in the
>   // range [0, 15].
>   int offset = 0;
>
>   char *buf = new char [n * sizeof (double) + offset];
>
>   return reinterpret_cast<double *> (&buf[offset]);
> }
>
> template <>
> Array<double>::ArrayRep::ArrayRep (int n)
>   : data (make_aligned_16_double_array (n)), len (n)
> {
>   std::cerr << "double" << std::endl;
> }
>
> int
> main (void)
> {
>   Array<int> int_ra (2);
>   Array<double> double_ra (2);
> }
>
> Or am I missing a simpler solution?

I'd hoped not to have to do this since its ugly. It seems to me that
there should be some form of compiler magic that should be capable of
ensuring the alignment, but I'm damned if I can figure it out.

Ok, I'll look at implementing the above. One advantage, even with the
ugliness, is the fact that it doesn't rely on the compiler and so there
are no needs for alignment checking at all in oct-fftw.cc

Cheers
David

--
David Bateman                                [hidden email]
Motorola CRM                                 +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 1 69 35 77 01 (Fax)
91193 Gif-Sur-Yvette FRANCE

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary


Reply | Threaded
Open this post in threaded view
|

Re: FFT slowup in 2.1.55

David Bateman-3
In reply to this post by John W. Eaton-6
Well, I don't know when I'll get a real chance to look at this issue,
and it might not be before the next release. So could you apply the
attached patch that at least gets rid of the thrashing back and forth
between aligned and unaligned FFTW plans.

If I do something about the alignment before the next, this code should
probably stay in for a while anyway, as a sanity check on the alignment
code....

Regards
David

According to John W. Eaton <[hidden email]> (on 02/27/04):

> On 27-Feb-2004, David Bateman <[hidden email]> wrote:
>
> | Ok, due to the alignment issues in FFTW 3.x I've found I can get
> | into the case where I'm calling exactly the same FFT, but the
> | alignment changes each time I call the function. The planner in
> | oct-fftw.cc then recreates the plan each time, which is slow.
> | For small FFT's [I've tested fft(randn(64,45))], this results in
> | a significant slowup.
> |
> | I've tried two lines of attack to fix this problem. The first is to
> | try and force the 16-byte alignment in "class Array<T>" using
> |
> | #ifdef HAVE_ATTRIB_ALIGN
> |     typedef T alignedT __attribute__ ((aligned(16)));
> | #else
> |     typedef T alignedT;
> | #endif
> |
> | where HAVE_ATTRIB_ALIGN is a configure time option. I then replace the
> | "new T" in ArrayRep with "new alignedT". This appears to only give me
> | 8-byte alignment, but I've read in the gcc manual that ld might not be
> | able to do better than 8-byte alignment, so it is not clear to me if
> |
> | 1) The code I've done is correct, but the linker won't give me 16-byte
> | alignment, or
> | 2) I've missing somewhere which also needs 16-byte alignment.
> |
> | Does anyone have any ideas on this?
>
> Doesn't the __attribute__ ((aligned(16)) qualifier apply to variables,
> not types?  It seems to me that attaching it to a typedef would have
> no effect.
>
> In any case, the variable you want to align in the Array classes
> doesn't actually exist in the class.  We only have a pointer to it,
> and you don't really want the pointer to be aligned on a 16-bit
> boundary, you want the object that it points to to be aligned.  So I
> think you have to write an allocator that will do that for you.  So
> instead of writing
>
>   explicit ArrayRep (int n) :
>     data (new alignedT [n]), len (n), count (1) { }
>
> I think it might work to write
>
>   explicit ArrayRep (int n) :
>     data (make_aligned_16_double_array (n)), len (n), count (1) { }
>
> where "make_aligned_16_double_array" is a function that returns a
> pointer to double which has the elements of the allocated array
> aligned on a 16-byte boundary.  To do that, you will probably need to
> allocate a char array of the appropriate size (might require up to 15
> bytes of padding) then return a pointer to one of the first 15
> elements with a cast to make it a pointer to double instead of a
> pointer to char.
>
> But the solution above won't quite work because ArrayRep is a template
> class.  So we need to do this only for some values of the template
> type paramater T (double and maybe Complex).  Some method of
> specializing these functions is needed.  Here is a simplified example
> that I think should work (I leave it up to you to determine how to
> compute the necessary offset so that the data pointed to by the pointer
> returned from the make_aligned_16_double_array function is actually on
> a 16-byte boundary).
>
> #include <iostream>
>
> template <class T>
> class
> Array
> {
> public:
>
>   class ArrayRep
>   {
>   public:
>
>     T *data;
>     int len;
>
>     explicit ArrayRep (int n)
>       : data (new T [n]), len (n) { std::cerr << "generic" << std::endl; }
>   };
>
>   ArrayRep *rep;
>
>   explicit Array (int n)
>     : rep (new typename Array<T>::ArrayRep (n)) { }
> };
>
> double *
> make_aligned_16_double_array (int n)
> {
>   // Do something magic to find the required offset (should be in the
>   // range [0, 15].
>   int offset = 0;
>
>   char *buf = new char [n * sizeof (double) + offset];
>
>   return reinterpret_cast<double *> (&buf[offset]);
> }
>
> template <>
> Array<double>::ArrayRep::ArrayRep (int n)
>   : data (make_aligned_16_double_array (n)), len (n)
> {
>   std::cerr << "double" << std::endl;
> }
>
> int
> main (void)
> {
>   Array<int> int_ra (2);
>   Array<double> double_ra (2);
> }
>
> Or am I missing a simpler solution?
>
> jwe
--
David Bateman                                [hidden email]
Motorola CRM                                 +33 1 69 35 48 04 (Ph)
Parc Les Algorithmes, Commune de St Aubin    +33 1 69 35 77 01 (Fax)
91193 Gif-Sur-Yvette FRANCE

The information contained in this communication has been classified as:

[x] General Business Information
[ ] Motorola Internal Use Only
[ ] Motorola Confidential Proprietary
Reply | Threaded
Open this post in threaded view
|

Re: FFT slowup in 2.1.55

John W. Eaton-6
On  2-Mar-2004, David Bateman <[hidden email]> wrote:

| Well, I don't know when I'll get a real chance to look at this issue,
| and it might not be before the next release. So could you apply the
| attached patch that at least gets rid of the thrashing back and forth
| between aligned and unaligned FFTW plans.
|
| If I do something about the alignment before the next, this code should
| probably stay in for a while anyway, as a sanity check on the alignment
| code....

I applied these changes.

Thanks,

jwe