Import large field-delimited file with strings and numbers

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Joao Rodrigues

On 08-09-2014 17:49, Philip Nienhuis wrote:

>> <snip>
>> Yet, csv2cell is orders of magnitude faster. I will break the big file
>> into chunks (using fileread, strfind to determine newlines and fprintf)
>> and then apply csv2cell chunk-wise.
> Why do you need to break it up using csv2cell? AFAICS that reads the entire
> file and directly translates the data into "values" in the output cell
> array, using very little temporary storage (the latter quite unlike
> textscan/strread).
> It does read the entire file twice, once to assess the required dimensions
> for the cell array, the second (more intensive) pass for actually reading
> the data.
The file I want to read has around 35 million rows, 15 columns and takes
200 MB of disk space: csv2cell would simply eat up all memory and the
computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it
behaved well until it received a chunk of 500 million rows (when memory
use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small
pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with
35 million rows and only 3 columns in gzip format it would take very
little disk space (20 MB or so) but when I tried to open it... it would
again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory
to cells, which is not very efficient (as opposed to dense or sparse
numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of
Octave there was something akin to ulimit installed by default to
prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013

http://www.bls.gov/cew/datatoc.htm explains the content.


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Joao Rodrigues
In reply to this post by CdeMills

On 08-09-2014 18:53, CdeMills wrote:

> Hello Joaõ,
> this is the perfect job for the "dataframe" package. It reads a CSV file and
> returns an object which try to mimics a matrix. You can select ranges,
> perform arithmetic operations, ...
>
> You'll find in on the octave-forge source.
Hi Pascal, thanks for the tip.

I got this:

octave:1> pkg install -forge dataframe
error: could not download file dataframe-1.0.0.tar.gz from url
http://packages.octave.org/download/dataframe-1.0.0.tar.gz
error: called from:
error:   /usr/share/octave/3.8.1/m/pkg/pkg.m at line 390, column 13

While browsing in the web I did not find any documentation (e.g., what
is the command to import a file or the format of objects), only a minor
description stating that it is analogue to R's data.frame. Can you point
me in the right direction?


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Markus Bergholz
In reply to this post by Joao Rodrigues


On Mon, Sep 8, 2014 at 7:54 PM, João Rodrigues <[hidden email]> wrote:

On 08-09-2014 17:49, Philip Nienhuis wrote:
<snip>
Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.
Why do you need to break it up using csv2cell? AFAICS that reads the entire
file and directly translates the data into "values" in the output cell
array, using very little temporary storage (the latter quite unlike
textscan/strread).
It does read the entire file twice, once to assess the required dimensions
for the cell array, the second (more intensive) pass for actually reading
the data.
The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013

404 in all combinations I've tried

 

http://www.bls.gov/cew/datatoc.htm explains the content.



_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave



--
icq: 167498924
XMPP|Jabber: [hidden email]

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Markus Bergholz


On Mon, Sep 8, 2014 at 9:27 PM, Markus Bergholz <[hidden email]> wrote:


On Mon, Sep 8, 2014 at 7:54 PM, João Rodrigues <[hidden email]> wrote:

On 08-09-2014 17:49, Philip Nienhuis wrote:
<snip>
Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.
Why do you need to break it up using csv2cell? AFAICS that reads the entire
file and directly translates the data into "values" in the output cell
array, using very little temporary storage (the latter quite unlike
textscan/strread).
It does read the entire file twice, once to assess the required dimensions
for the cell array, the second (more intensive) pass for actually reading
the data.
The file I want to read has around 35 million rows, 15 columns and takes 200 MB of disk space: csv2cell would simply eat up all memory and the computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it behaved well until it received a chunk of 500 million rows (when memory use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with 35 million rows and only 3 columns in gzip format it would take very little disk space (20 MB or so) but when I tried to open it... it would again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013

404 in all combinations I've tried

nvm, got it.
which columns do you need?
 

 

http://www.bls.gov/cew/datatoc.htm explains the content.



_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave



--
icq: 167498924
XMPP|Jabber: [hidden email]



--
icq: 167498924
XMPP|Jabber: [hidden email]

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Joao Rodrigues
On 09/08/2014 08:27 PM, Markus Bergholz wrote:

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013


nvm, got it.
which columns do you need?
Hi. As I said above, I already solved the problem (with your help).

I just put the link so that someone interested can check the memory overload problem.

(But the data I need to extract is in columns 1-3 and 8-11.)

Many thanks
j


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Markus Bergholz


On Mon, Sep 8, 2014 at 9:44 PM, Joao Rodrigues <[hidden email]> wrote:
On 09/08/2014 08:27 PM, Markus Bergholz wrote:

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013


nvm, got it.
which columns do you need?
Hi. As I said above, I already solved the problem (with your help).

I just put the link so that someone interested can check the memory overload problem.

(But the data I need to extract is in columns 1-3 and 8-11.)

octave:17> joao
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
Elapsed time is 50.5821 seconds.
octave:18> memory

 Memory used by Octave:   277.438 MB
 Physical Memory (RAM): 7929.52 MB

octave:19> mean(mean(m))
ans =       9605668.464203


octave:21> m(1:10,:)
ans =

                     0                     1                     0                     0                  1196                 54587            4050257902
                     0                     1                     0                     0                  1196                 54587            4050257902
                     0                     1                     0                     0                   587                 11537             694789102
                     0                     1                     0                     0                     2                    13                437863
                     0                     1                     0                     0                    17                   154              11894235
                     0                     1                     0                     0                    46                  1760             144453410
                     0                     1                     0                     0                    32                  6245             406490963
                     0                     1                     0                     0                    26                   862              16926869
                     0                     1                     0                     0                     3                   566              49683552
                     0                     1                     0                     0                   484                 33451            2725581908



m=ones(3565137,7);
index = [1 2 3 8 9 10 11];
tic
for n = 1:length(index)
  m(:,n) = lua('colaccess','2013.annual.singlefile.csv',index(n),2)';
end
toc


and the lua script is not optimized... should be possible to pimp it < 30 seconds.
but it's experimental, haven't verified the result :P


 

Many thanks
j




--
icq: 167498924
XMPP|Jabber: [hidden email]

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Markus Bergholz


On Mon, Sep 8, 2014 at 10:14 PM, Markus Bergholz <[hidden email]> wrote:


On Mon, Sep 8, 2014 at 9:44 PM, Joao Rodrigues <[hidden email]> wrote:
On 09/08/2014 08:27 PM, Markus Bergholz wrote:

Bottom line: I think it has to do with the way Octave allocates memory to cells, which is not very efficient (as opposed to dense or sparse numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of Octave there was something akin to ulimit installed by default to prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013


nvm, got it.
which columns do you need?
Hi. As I said above, I already solved the problem (with your help).

I just put the link so that someone interested can check the memory overload problem.

(But the data I need to extract is in columns 1-3 and 8-11.)

octave:17> joao
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
all done, lua stack top 0
Elapsed time is 50.5821 seconds.
octave:18> memory

 Memory used by Octave:   277.438 MB
 Physical Memory (RAM): 7929.52 MB

octave:19> mean(mean(m))
ans =       9605668.464203


octave:21> m(1:10,:)
ans =

                     0                     1                     0                     0                  1196                 54587            4050257902
                     0                     1                     0                     0                  1196                 54587            4050257902
                     0                     1                     0                     0                   587                 11537             694789102
                     0                     1                     0                     0                     2                    13                437863
                     0                     1                     0                     0                    17                   154              11894235
                     0                     1                     0                     0                    46                  1760             144453410
                     0                     1                     0                     0                    32                  6245             406490963
                     0                     1                     0                     0                    26                   862              16926869
                     0                     1                     0                     0                     3                   566              49683552
                     0                     1                     0                     0                   484                 33451            <a href="tel:2725581908" value="+12725581908" target="_blank">2725581908



m=ones(3565137,7);
index = [1 2 3 8 9 10 11];
tic
for n = 1:length(index)
  m(:,n) = lua('colaccess','2013.annual.singlefile.csv',index(n),2)';
end
toc


and the lua script is not optimized... should be possible to pimp it < 30 seconds.
but it's experimental, haven't verified the result :P

yeah, column 1 and 3 are wrong, because they are strings in the csv file...should be easy to manage.

 


 

Many thanks
j




--
icq: 167498924
XMPP|Jabber: [hidden email]



--
icq: 167498924
XMPP|Jabber: [hidden email]

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

CdeMills
In reply to this post by Joao Rodrigues
Joao Rodrigues wrote
I got this:

octave:1> pkg install -forge dataframe
error: could not download file dataframe-1.0.0.tar.gz from url
http://packages.octave.org/download/dataframe-1.0.0.tar.gz
error: called from:
error:   /usr/share/octave/3.8.1/m/pkg/pkg.m at line 390, column 13

While browsing in the web I did not find any documentation (e.g., what
is the command to import a file or the format of objects), only a minor
description stating that it is analogue to R's data.frame. Can you point
me in the right direction?
Strange. I'll ask Carnë to look at this issue. He uploaded this version at the end of march. My guess is that you need to download the package manually and  install it from the local file.

For the  use, the main information is contained in the main file itself. Some simple call:

resu=dataframe('some_data.csv');

There are a few options to struggle with issue like numbers encoded as '3,14' where the comma is used as decimal separator, skip some initial garbage (free format header f.i.), and so on.

Once the data are loaded, the object tries to mimic an ordinary array; except that the only constraint is that each column must be from an homogenous type. You may have a global object with the first colum a string, the second a char, the third an integer, and so on.

regards

Pascal
Reply | Threaded
Open this post in threaded view
|

Re: how does fscanf with "C" work?

Francesco Potortì
In reply to this post by Thomas D. Dean-2
>> I'll try to be more specific.  If I get no answer from this list, I'll
>> submit a bug report.
>>
>> I create a text file made of four lines, like this:
>>
>> ===File ~/math/workarea/test.txt============================
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>> ============================================================
>>
>> Then this is what i get:
>>
>> octave> version
>> ans = 3.8.2
>> octave> fid=fopen("test.txt");
>> octave> fgetl(fid);
>> octave> [v1, v2, v3, v4, count, errmsg] = fscanf(fid, "%s%s%d,%f,", "C")
>> v1 = A,
>> v2 = a,
>> v3 =  1
>> v4 =  1
>> count =  4
>> errmsg =
>> octave> fclose(fid);
>
>I have not been following this, but, fscanf is doing exactly what you
>asked it to.

Yes, apparently it is so.  I'll propose a clarification to the manual
once I understand exactly what's going on.

>Read 'help fscanf' carefully.  You are using the second form.
>
>fd=fopen("xx.dat")
>fgetl(fd)
>[v1, v2, v3, v4, count, errmsg] = fscanf(fd, "%s%s%d,%f", "C")
>
>This is iterative and so, slow.

Yes.  However, I was assuming that scanf in the "C" form behaved like
the first form, that is, kept reading its input until end of file.  This
would have been specular to the printf function, which reuses its
template when given a matrix to print.  Moreover, the manual names V1,
V2 the returned values, suggesting that they are vectors.  From my
experiments and your answer, it appears that my understanding was wrong.

>You can read the entire file using the first form of fscanf.  However,
>since the strings are not quoted, octave will translate the ascii
>characters to numerical byte values.
>A==65, B==66, etc.

This is new to me, and I cannot find it in the manual.  Again, once I
completely understand this, I'll propose a correction to the manual.

>fd=fopen("xx.dat")
>fgetl(fd)
>[A,count,msg]=fscanf(fd,"%s%s%d,%f%s")
>
>The contents of A show this conversion and also, the extra ',' on each
>line that must be accounted for in the format.

Alright, I'll make experiments and get back here.

>By the way, your first paragraph sounds threatening and makes many
>people not answer.

Oops, sorry, that was certainly not my intention :)

I meant that if people do not answer some question here, probably no one
knows, and maybe there is a bug...

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Philip Nienhuis
In reply to this post by Philip Nienhuis
Philip Nienhuis wrote
Joao Rodrigues wrote
>> I need to import a large CSV file with multiple columns with mixed
>> string and number entries, such as:

<snip>
Why do you need to break it up using csv2cell? AFAICS that reads the entire file and directly translates the data into "values" in the output cell array, using very little temporary storage (the latter quite unlike textscan/strread).
It does read the entire file twice, once to assess the required dimensions for the cell array, the second (more intensive) pass for actually reading the data.

BTW, on my TODO list is an old (2+ years?) entry: adding a "headerlines" parameter to csv2cell..... (but OK my TODO list is looong)
FYI, I've just added this headerlines parameter to csv2cell; I needed it myself as well, so your post was the final nudge to get me started with it. I'll push it later this week.

Along the way I've also promoted the row counter type from int to long (int) - supposedly that'll make a difference with some extremely big data files :-)

This will be part of a (soon to come) io-2.2.4 release.

Philip
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Philip Nienhuis
In reply to this post by Joao Rodrigues
Joao Rodrigues wrote
On 08-09-2014 17:49, Philip Nienhuis wrote:
>> <snip>
>> Yet, csv2cell is orders of magnitude faster. I will break the big file
>> into chunks (using fileread, strfind to determine newlines and fprintf)
>> and then apply csv2cell chunk-wise.
> Why do you need to break it up using csv2cell? AFAICS that reads the entire
> file and directly translates the data into "values" in the output cell
> array, using very little temporary storage (the latter quite unlike
> textscan/strread).
> It does read the entire file twice, once to assess the required dimensions
> for the cell array, the second (more intensive) pass for actually reading
> the data.
The file I want to read has around 35 million rows, 15 columns and takes
200 MB of disk space: csv2cell would simply eat up all memory and the
computer stopped responding.

I tried to feed it small chunk of increasing size and found out that it
behaved well until it received a chunk of 500 million rows (when memory
use went through the stratosphere).

So I opted for the clumsy solution of breaking the file into small
pieces and spoon feed csv2cell.

But then I found out something interesting. If I would save a cell with
35 million rows and only 3 columns in gzip format it would take very
little disk space (20 MB or so) but when I tried to open it... it would
again take forever and eat up GBs of memory.

Bottom line: I think it has to do with the way Octave allocates memory
to cells, which is not very efficient (as opposed to dense or sparse
numerical data, which it handles very well).

I managed to solve the problem I had, thanks to the help of you guys.

However, I think it would probably be nice if in future versions of
Octave there was something akin to ulimit installed by default to
prevent a process from eating up all available memory.

If someone wants to check this issue the data I am working with is public:

http://www.bls.gov/cew/data/files/*/csv/*_annual_singlefile.zip

where * = 1990:2013

http://www.bls.gov/cew/datatoc.htm explains the content.
I d/led the 2013 file and gave it a try with csv2cell with a 64-bit Octave.
csv2cell() didn't even need the new headerlines parameter - it is a neat .csv cell from bottom to top.
Results:


>> tic; data = csv2cell ('2013.annual.singlefile.csv'); toc
Elapsed time is 20.2152 seconds.
>> size (data)
ans =

   3565139        15

>> whos
Variables in the current scope:

   Attr Name         Size                     Bytes  Class
   ==== ====         ====                     =====  =====
        ans          1x2                         16  double
        data   3565139x15                 354645851  cell

Total is 53477087 elements using 354645867 bytes

>>

...and Octave's memory usage is ~4.6 GB (total occupied RAM on my Win7-64b box was 5.75 GB). So you'd need at least a 64-bit Octave + 64-bit OS. For Windows a (experimental but IMO fairly good) 64-bit Octave is available these days.

Even after stripping away the rightmost columns, saving the result to a .mat file, restarting Octave and reading back the .mat file, Octave still needs > 4 GB to read the file. Once in workspace the data occupies > 2 GB RAM, while according to "whos" the cell array (3565139 x 4) occupies ~100 MB.
Puzzling numbers... as you say, Octave apparently needs a lot more RAM behind the scenes to hold such big cell arrays.

Philip
12