incremental read of gzipped matrix

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

incremental read of gzipped matrix

Andreas Weber-6
Dear all,

is there currently a method in GNU Octave core or forge package to open
a gziped file, read as much rows as available and return them as matrix?
I know there is 'm = load("foo.gz")' but this always reads the whole file.

I'm able to flush (for example Z_PARTIAL_FLUSH) the gzstream from the
writing process so that the gz-decompressor is able to decode it.

Thanks, Andy


Reply | Threaded
Open this post in threaded view
|

Re: incremental read of gzipped matrix

Andreas Weber-6
Am 07.12.19 um 05:44 schrieb Andreas Weber:
> is there currently a method in GNU Octave core or forge package to open
> a gziped file, read as much rows as available and return them as matrix?

I forgot the important part: Keep the file open an store the position so
that it's possible to later read the new part:

pseudo code:
fid = gzopen ("foo.gz", "r");
m = fget_matrix (fid); # returns a 50x5 matrix;
.... sleep....
# in the meanwhile another process appends data to the .gz file
m = fget_matrix (fid); # now returns a 20x5 matrix whith new data;

-- Andy


Reply | Threaded
Open this post in threaded view
|

Re: incremental read of gzipped matrix

Andreas Weber-6
Am 07.12.19 um 06:20 schrieb Andreas Weber:

> Am 07.12.19 um 05:44 schrieb Andreas Weber:
>> is there currently a method in GNU Octave core or forge package to open
>> a gziped file, read as much rows as available and return them as matrix?
>
> I forgot the important part: Keep the file open an store the position so
> that it's possible to later read the new part:
>
> pseudo code:
> fid = gzopen ("foo.gz", "r");
> m = fget_matrix (fid); # returns a 50x5 matrix;
> .... sleep....
> # in the meanwhile another process appends data to the .gz file
> m = fget_matrix (fid); # now returns a 20x5 matrix whith new data;


In the meanwhile I've started my implementation:
  https://github.com/Andy1978/load_gz/tree/master

Until know I get a factor 10 improvement in runtime reading large
gzipped numeric CSVs:

For example rand (1e6, 8)
Octave 5.1.1: Elapsed time is 16.9351 seconds.
load_gz.oct   Elapsed time is 1.49143 seconds.

If someone wants to try:
  git clone https://github.com/Andy1978/load_gz.git
  cd load_gz.git
  make check

The incremental part is still missing.

-- Andy


Reply | Threaded
Open this post in threaded view
|

Re: incremental read of gzipped matrix

mmuetzel
Am 08. Dezember 2019 um 18:55 Uhr schrieb "Andreas Weber":

> Am 07.12.19 um 06:20 schrieb Andreas Weber:
> > Am 07.12.19 um 05:44 schrieb Andreas Weber:
> >> is there currently a method in GNU Octave core or forge package to open
> >> a gziped file, read as much rows as available and return them as matrix?
> >
> > I forgot the important part: Keep the file open an store the position so
> > that it's possible to later read the new part:
> >
> > pseudo code:
> > fid = gzopen ("foo.gz", "r");
> > m = fget_matrix (fid); # returns a 50x5 matrix;
> > .... sleep....
> > # in the meanwhile another process appends data to the .gz file
> > m = fget_matrix (fid); # now returns a 20x5 matrix whith new data;
>
>
> In the meanwhile I've started my implementation:
>   https://github.com/Andy1978/load_gz/tree/master
>
> Until know I get a factor 10 improvement in runtime reading large
> gzipped numeric CSVs:
>
> For example rand (1e6, 8)
> Octave 5.1.1: Elapsed time is 16.9351 seconds.
> load_gz.oct   Elapsed time is 1.49143 seconds.
>
> If someone wants to try:
>   git clone https://github.com/Andy1978/load_gz.git
>   cd load_gz.git
>   make check
>
> The incremental part is still missing.

Octave has the gzifstream (in zfstream.h) which might be useful for your purpose.
There is no direct front end in Octave script language though.

Markus



Reply | Threaded
Open this post in threaded view
|

Re: incremental read of gzipped matrix

Andreas Weber-6
Hey Markus,

Am 08.12.19 um 19:25 schrieb "Markus Mützel":
> Octave has the gzifstream (in zfstream.h) which might be useful for your purpose.

I've used this before and only got an improvement by factor 2.
The factor 2 is because load currently parses the inputfile twice: the
first run is to detect number of columns and rows, then the Matrix is
allocated then parsed the second time.

In the meanwhile I've enabled incremental reads in load_gz which is very
important for me if I periodically poll an increasing .gz file.

-- Andy


Reply | Threaded
Open this post in threaded view
|

Re: incremental read of gzipped matrix

mmuetzel
Am 08. Dezember 2019 um 22:30 Uhr schrieb "Andreas Weber":
> Am 08.12.19 um 19:25 schrieb "Markus Mützel":
> > Octave has the gzifstream (in zfstream.h) which might be useful for your purpose.
>
> I've used this before and only got an improvement by factor 2.
> The factor 2 is because load currently parses the inputfile twice: the
> first run is to detect number of columns and rows, then the Matrix is
> allocated then parsed the second time.

Do you know why gzifstream is so much slower compared to your code? Given that you measured a 10 fold performance increase, there might be room for a factor of 5 (taking the double parsing into account).

> In the meanwhile I've enabled incremental reads in load_gz which is very
> important for me if I periodically poll an increasing .gz file.
>
> -- Andy
>