Import large field-delimited file with strings and numbers

classic Classic list List threaded Threaded
31 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Import large field-delimited file with strings and numbers

Joao Rodrigues

I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

How can I perform the import within reasonable time and little memory
overhead? Below are a few of my attempts.

Octave offers a wide range of functions to import files (csvread,
dlmread, textscan, textread, fscanf, fgetline) but as far as I can tell
none seems to get the job done.

csvread and dlmread don't work because they only handle numerical data.

textscan works eats up all the memory (the file is 200 MB, textscan's
memory usage was into the GB's). It doesn't allow to provide a priori
the size of the object.

fid = fopen(fstr,"r");
[tmp] = textscan(fid,'%s  %s %d %d','delimiter', ',', 'headerlines', 1);
fclose(fid);

fgetline allow to define the size of the object a priori but requires a
loop:

v = cell(nrow,4);
fid = fopen(fstr,"r");
tmp = fgetl(fid);
for irow = 1 : nrow
     tmp = fgetl(fid);
     v(irow,:) = strsplit(tmp,",");
endfor
fclose(fid);

Any suggestions? (I browsed google and the only suggestion I got was
using fgetl, but this is too slow. It takes 30sec to read 1% of the full
dataset).

Thanks







_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Francesco Potortì
>I need to import a large CSV file with multiple columns with mixed
>string and number entries
...
>How can I perform the import within reasonable time and little memory
>overhead?

I suspect that fscanf is the fastest.  The main drawback is that it is
not stratightforward to use and requires some trial and error.

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Andreas Weber-4
In reply to this post by Joao Rodrigues
Am 06.09.2014 16:19, schrieb João Rodrigues:

>
> I need to import a large CSV file with multiple columns with mixed
> string and number entries, such as:
>
> field1, field2, field3, field4
> A,        a,        1,       1.0,
> B,        b,        2,        2.0,
> C,        c,        3,        3.0,
>
> and I want to pass this on to something like
>
> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};

Are the first 2 columns only one character long? Then you don't need to
use a cell array for this. A simple char matrix would be sufficient.
-- Andy

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Joao Rodrigues
On 09/06/2014 04:40 PM, Andreas Weber wrote:

> Am 06.09.2014 16:19, schrieb João Rodrigues:
>> I need to import a large CSV file with multiple columns with mixed
>> string and number entries, such as:
>>
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>>
>> and I want to pass this on to something like
>>
>> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
>> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
> Are the first 2 columns only one character long? Then you don't need to
> use a cell array for this. A simple char matrix would be sufficient.
The strings generally have more than one character.

[VAL, COUNT, ERRMSG] = fscanf (FID, TEMPLATE, SIZE)

To answer Francesco's comment, fscanf allows for size specification
without loop, but I don't know how to write the template indicating that
commas (or something else) is a separator. Furthermore, even if the
separator is whitespace as output I got a concatenated string instead of
multiple fields...


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Thomas D. Dean-2
On 09/06/14 08:50, Joao Rodrigues wrote:

> To answer Francesco's comment, fscanf allows for size specification
> without loop, but I don't know how to write the template indicating that
> commas (or something else) is a separator. Furthermore, even if the
> separator is whitespace as output I got a concatenated string instead of
> multiple fields...

Try putting the comma in the format, like "%s,%d,%d"

Tom Dean

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

bpabbott
Administrator
In reply to this post by Joao Rodrigues

On Sep 6, 2014, at 10:19 AM, João Rodrigues <[hidden email]> wrote:

>
> I need to import a large CSV file with multiple columns with mixed string and number entries, such as:
>
> field1, field2, field3, field4
> A,        a,        1,       1.0,
> B,        b,        2,        2.0,
> C,        c,        3,        3.0,
>
> and I want to pass this on to something like
>
> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
> arr3 =[1 2 3]';
> arr4 =[1.0 2.0 3.0]';
>
> furthermore, some columns can be ignored, the total number of entries is known and there is a header.
>
> How can I perform the import within reasonable time and little memory overhead? Below are a few of my attempts.
>
> Octave offers a wide range of functions to import files (csvread, dlmread, textscan, textread, fscanf, fgetline) but as far as I can tell none seems to get the job done.
>
> csvread and dlmread don't work because they only handle numerical data.
>
> textscan works eats up all the memory (the file is 200 MB, textscan's memory usage was into the GB's). It doesn't allow to provide a priori the size of the object.
>
> fid = fopen(fstr,"r");
> [tmp] = textscan(fid,'%s  %s %d %d','delimiter', ',', 'headerlines', 1);
> fclose(fid);
>
> fgetline allow to define the size of the object a priori but requires a loop:
>
> v = cell(nrow,4);
> fid = fopen(fstr,"r");
> tmp = fgetl(fid);
> for irow = 1 : nrow
>    tmp = fgetl(fid);
>    v(irow,:) = strsplit(tmp,",");
> endfor
> fclose(fid);
>
> Any suggestions? (I browsed google and the only suggestion I got was using fgetl, but this is too slow. It takes 30sec to read 1% of the full dataset).
>
> Thanks

Assuming your file is always as simple as your example ... you may get improved speed by just avoiding loops.

I copied your example to a file foo.txt and modified it to avoid comma's at the end of each line.

field1, field2, field3, field4
A,        a,        1,       1.0
B,        b,        2,        2.0
C,        c,        3,        3.0

The code below avoids loops.

str = strsplit ((fileread ('foo.txt')), '\n');
str(end) = []; % strip empty string due to newline at EOF
table = cellfun (@(str) strsplit (str, {' ',','}), str, 'uniformoutput', false);
table = vertcat (table(2:end){:})
cell1 = table(:,1);
cell2 = table(:,2);
arr3 = str2num (cell2mat (table(:,3)));
arr4 = str2num (cell2mat (table(:,4)));

Ben


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Philip Nienhuis
In reply to this post by Joao Rodrigues
Joao Rodrigues wrote
I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.
If you can get rid of the header and if the number of columns on each line is constant, csv2cell() in the io package is by far the fastest. It can read mixed numerical/text data.

Another useful trick that I sometimes use myself would be to read the file with textscan but then in chunks. You can specify the number of lines to read. textscan should remember the file position (see "help textscan").
After having read chunk# N, you can simply restart textscan (w/o headerlines param!) to read chunk# N+1, and repeat until EOF.
If you start with a small chunk you can check if the format string works.
Later on it is easy to append the data columns together (i.e., vertically concatenate the output cell matrices of textscan).

As for file size: I more or less regularly use textscan to read 30-50 MB csv files with 32-bit Octave in one swoop. Takes a while, true, but it works.

Philip
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Joao Rodrigues

>> I need to import a large CSV file with multiple columns with mixed
>> string and number entries, such as:
>>
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>>
>> and I want to pass this on to something like
>>
>> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
>> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
>> arr3 =[1 2 3]';
>> arr4 =[1.0 2.0 3.0]';
>>
>> furthermore, some columns can be ignored, the total number of entries is
>> known and there is a header.

In response to Thomas Dean, Ben Abbot and Philip Nienhuis: Thanks for
the tips. (I did not know either the fileread or csv2cell functions).

Nevertheless, the problem persists. Either with fscanf, cellfun +
strsplit or csv2cell (indeed the fastest) the memory requirements blow
up (4 GB RAM + a few more GB of swap) when I want to import the big csv
(~3.5 million lines, 15 columns, 200 MB).

On the contrary, if I use a loop with strsplit it takes forever (~ 30
min) but memory use is just a few hundred MB.

> Another useful trick that I sometimes use myself would be to read the file
> with textscan but then in chunks. You can specify the number of lines to
> read. textscan should remember the file position (see "help textscan").
> After having read chunk# N, you can simply restart textscan (w/o headerlines
> param!) to read chunk# N+1, and repeat until EOF.
I guess this is the right direction. I had never noticed the N option in
textscan. Using textscan to read chunk by chunk is much faster (~ 8 min).

Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.

Thank you all
Joao






_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Francesco Potortì
In reply to this post by Joao Rodrigues
>To answer Francesco's comment, fscanf allows for size specification
>without loop, but I don't know how to write the template indicating that
>commas (or something else) is a separator. Furthermore, even if the
>separator is whitespace as output I got a concatenated string instead of
>multiple fields...

Hm.  I tried myself but can't get fscanf to read multiple lines...

Here is what I get:


===File ~/math/workarea/test.txt============================
field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,
============================================================

octave> fid=fopen("test.txt");
octave> fgetl(fid); [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
v1 = A,
v2 = a,
v3 =  1
v4 =  1
count =  4
errmsg =
octave> fclose(fid);

So everything is nice and good, but fscanf stops at the first row of
data...

Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
adding a newline at the end, because the Octave manual does not say what
a "whitespace" is, but no change :(

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

bpabbott
Administrator

On Sep 7, 2014, at 3:17 PM, Francesco Potortì <[hidden email]> wrote:

>> To answer Francesco's comment, fscanf allows for size specification
>> without loop, but I don't know how to write the template indicating that
>> commas (or something else) is a separator. Furthermore, even if the
>> separator is whitespace as output I got a concatenated string instead of
>> multiple fields...
>
> Hm.  I tried myself but can't get fscanf to read multiple lines...
>
> Here is what I get:
>
>
> ===File ~/math/workarea/test.txt============================
> field1, field2, field3, field4
> A,        a,        1,       1.0,
> B,        b,        2,        2.0,
> C,        c,        3,        3.0,
> ============================================================
>
> octave> fid=fopen("test.txt");
> octave> fgetl(fid); [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
> v1 = A,
> v2 = a,
> v3 =  1
> v4 =  1
> count =  4
> errmsg =
> octave> fclose(fid);
>
> So everything is nice and good, but fscanf stops at the first row of
> data...
>
> Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
> adding a newline at the end, because the Octave manual does not say what
> a "whitespace" is, but no change :(

The doc-string indicates that a 2nd input is needed for fgetl() to read more than one line.

 -- Built-in Function: STR = fgetl (FID)
 -- Built-in Function: STR = fgetl (FID, LEN)
     Read characters from a file, stopping after a newline, or EOF, or
     LEN characters have been read.  The characters read, excluding the
     possible trailing newline, are returned as a string.

     If LEN is omitted, 'fgetl' reads until the next newline character.

     If there are no more characters to read, 'fgetl' returns -1.

     To read a line and return the terminating newline see 'fgets'.

     See also: fgets, fscanf, fread, fopen.

Ben


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Petri Piila
In reply to this post by Francesco Potortì
I succeed in extracting the last two numeric columns by applying %*s to the first two string columns:

octave:76> f = fopen("test.txt","r")
f =  4
octave:77> fgetl(f)
ans = field1, field2, field3, field4
octave:78> c34 = fscanf(f,"%*s%*s%d,%f,",[2,Inf]);
octave:79> c34'
ans =

   1   1
   2   2
   3   3

octave:80> fclose(f);

However, I’m not able to extract the strings accordingly.

With the “C” approach you used, I think you have to loop.


On 07 Sep 2014, at 22:17, Francesco Potortì <[hidden email]> wrote:

To answer Francesco's comment, fscanf allows for size specification
without loop, but I don't know how to write the template indicating that
commas (or something else) is a separator. Furthermore, even if the
separator is whitespace as output I got a concatenated string instead of
multiple fields...

Hm.  I tried myself but can't get fscanf to read multiple lines...

Here is what I get:


===File ~/math/workarea/test.txt============================
field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,
============================================================

octave> fid=fopen("test.txt");
octave> fgetl(fid); [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
v1 = A,
v2 = a,
v3 =  1
v4 =  1
count =  4
errmsg =
octave> fclose(fid);

So everything is nice and good, but fscanf stops at the first row of
data...

Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
adding a newline at the end, because the Octave manual does not say what
a "whitespace" is, but no change :(

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Francesco Potortì
In reply to this post by bpabbott
Joao Rodrigues:
>>> To answer Francesco's comment, fscanf allows for size specification
>>> without loop, but I don't know how to write the template indicating that
>>> commas (or something else) is a separator. Furthermore, even if the
>>> separator is whitespace as output I got a concatenated string instead of
>>> multiple fields...

Francesco Potortì:

>> Hm.  I tried myself but can't get fscanf to read multiple lines...
>>
>> Here is what I get:
>>
>>
>> ===File ~/math/workarea/test.txt============================
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>> ============================================================
>>
>> octave> fid=fopen("test.txt");
>> octave> fgetl(fid);
>> octave> [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
>> v1 = A,
>> v2 = a,
>> v3 =  1
>> v4 =  1
>> count =  4
>> errmsg =
>> octave> fclose(fid);
>>
>> So everything is nice and good, but fscanf stops at the first row of
>> data...
>>
>> Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
>> adding a newline at the end, because the Octave manual does not say what
>> a "whitespace" is, but no change :(

Ben Abbott:
>The doc-string indicates that a 2nd input is needed for fgetl() to read more than one line.

Hm, maybe the way I wrote it was cconfusing.  I now edited the above
cited text by adding a newline and a prompt in my code for increased
clarity.

In the code, fgetl is just used to get rid of the header line.  My
problem is with fscanf, which in my plan should read all the subsequent
lines, but only reads the first one after the header line.

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Francesco Potortì
In reply to this post by Petri Piila
I rearranged to follow this list's convention of bottom posting.  Please
answer at the end.

>On 07 Sep 2014, at 22:17, Francesco Potortì <[hidden email]> wrote:
>
>>> To answer Francesco's comment, fscanf allows for size specification
>>> without loop, but I don't know how to write the template indicating that
>>> commas (or something else) is a separator. Furthermore, even if the
>>> separator is whitespace as output I got a concatenated string instead of
>>> multiple fields...
>>
>> Hm.  I tried myself but can't get fscanf to read multiple lines...
>>
>> Here is what I get:
>>
>>
>> ===File ~/math/workarea/test.txt============================
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>> ============================================================
>>
>> octave> fid=fopen("test.txt");
>> octave> fgetl(fid); [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
>> v1 = A,
>> v2 = a,
>> v3 =  1
>> v4 =  1
>> count =  4
>> errmsg =
>> octave> fclose(fid);
>>
>> So everything is nice and good, but fscanf stops at the first row of
>> data...
>>
>> Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
>> adding a newline at the end, because the Octave manual does not say what
>> a "whitespace" is, but no change :(

Petri Piila;

>I succeed in extracting the last two numeric columns by applying %*s to the first two string columns:
>
>octave:76> f = fopen("test.txt","r")
>f =  4
>octave:77> fgetl(f)
>ans = field1, field2, field3, field4
>octave:78> c34 = fscanf(f,"%*s%*s%d,%f,",[2,Inf]);
>octave:79> c34'
>ans =
>
>   1   1
>   2   2
>   3   3
>
>octave:80> fclose(f);
>
>However, I’m not able to extract the strings accordingly.

You need the "C" approach to to that.

>With the “C” approach you used, I think you have to loop.

While the docs are not clear on this, it seems not.  Since fscanf
returns vectors, I suppose it is intended to read until end of file.
But I cannot seem to have it work...

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

bpabbott
Administrator
In reply to this post by Francesco Potortì
On Sep 7, 2014, at 4:52 PM, Francesco Potortì <[hidden email]> wrote:

> Ben Abbott:
>> The doc-string indicates that a 2nd input is needed for fgetl() to read more than one line.
>
> Hm, maybe the way I wrote it was cconfusing.  I now edited the above
> cited text by adding a newline and a prompt in my code for increased
> clarity.

Opps ... you're correct. I misread your code.

Ben


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Philip Nienhuis
In reply to this post by Francesco Potortì
Francesco Potortì wrote
Joao Rodrigues:
>>> To answer Francesco's comment, fscanf allows for size specification
>>> without loop, but I don't know how to write the template indicating that
>>> commas (or something else) is a separator. Furthermore, even if the
>>> separator is whitespace as output I got a concatenated string instead of
>>> multiple fields...

Francesco Potortì:
>> Hm.  I tried myself but can't get fscanf to read multiple lines...
>>
>> Here is what I get:
>>
>>
>> ===File ~/math/workarea/test.txt============================
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>> ============================================================
>>
>> octave> fid=fopen("test.txt");
>> octave> fgetl(fid);
>> octave> [v1, v2, v3, v4, count, errmsg]=fscanf(fid, "%s%s%d,%f,", "C")
>> v1 = A,
>> v2 = a,
>> v3 =  1
>> v4 =  1
>> count =  4
>> errmsg =
>> octave> fclose(fid);
>>
>> So everything is nice and good, but fscanf stops at the first row of
>> data...
>>
>> Oh, and I also tried twith a template "%s%s%d,%f,\n", explicitely
>> adding a newline at the end, because the Octave manual does not say what
>> a "whitespace" is, but no change :(

Ben Abbott:
>The doc-string indicates that a 2nd input is needed for fgetl() to read more than one line.

Hm, maybe the way I wrote it was cconfusing.  I now edited the above
cited text by adding a newline and a prompt in my code for increased
clarity.

In the code, fgetl is just used to get rid of the header line.  My
problem is with fscanf, which in my plan should read all the subsequent
lines, but only reads the first one after the header line.
Hopefully I properly understand what you want to achieve, but to read the rest of an entire file starting at some position, I usually do:

<....fseek or previous read operations....>
str = fread (fid, "char=>char").';  ## Note transpose operator

Philip
Reply | Threaded
Open this post in threaded view
|

how does fscanf with "C" work?

Francesco Potortì
In reply to this post by Francesco Potortì
I wrote:

>In the code, fgetl is just used to get rid of the header line.  My
>problem is with fscanf, which in my plan should read all the subsequent
>lines, but only reads the first one after the header line.

I'll try to be more specific.  If I get no answer from this list, I'll
submit a bug report.

I create a text file made of four lines, like this:

===File ~/math/workarea/test.txt============================
field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,
============================================================

Then this is what i get:

octave> version
ans = 3.8.2
octave> fid=fopen("test.txt");
octave> fgetl(fid);
octave> [v1, v2, v3, v4, count, errmsg] = fscanf(fid, "%s%s%d,%f,", "C")
v1 = A,
v2 = a,
v3 =  1
v4 =  1
count =  4
errmsg =
octave> fclose(fid);

The result is that fscanf reads a single line from the file
(specifically, the second line).  What I expected was fscanf to read all
the subsequent lines (second, third and fourth) and put the results in
the four vectors given as output.

The same happens if I use a character conversion rather than a string
conversion:

octave> fid=fopen("test.txt");
octave> fgetl(fid);
octave> [v1, v2, v3, v4, count, errmsg] = fscanf(fid, "%c, %c,%d,%f,\n", "C")
v1 = A
v2 = a
v3 =  1
v4 =  1
count =  4
errmsg =
octave> fclose(fid);

So the question is: is fscanf with the "C" calling convention intended
and able to read multi-line input?

Whatever the answer, a clarification of the manual is needed.  I can
contribute to clarifying the manual, once I understand how fscanf with
"C" is meant to behave.

--
Francesco Potortì (ricercatore)        Voice:  +39.050.621.3058
ISTI - Area della ricerca CNR          Mobile: +39.348.8283.107
via G. Moruzzi 1, I-56124 Pisa         Skype:  wnlabisti
(entrance 20, 1st floor, room C71)     Web:    http://fly.isti.cnr.it

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Helios de Rosario
In reply to this post by Joao Rodrigues
>>> João Rodrigues<[hidden email]> 06/09/2014 16:19 >>>

I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

How can I perform the import within reasonable time and little memory
overhead? Below are a few of my attempts.
>>
 
As others have mentioned, the problem is easy if in your table all columns have the same number of characters. In that case, you can read the whole text into a char matrix, and then use mat2cell to separate the strings by the width of the columns, or str2num to put the numeric values in a matrix of doubles.
 
If the file is too large, you cannot avoid looping, but you can work with "not-so-large" text chunks, e.g.:
 
cell1 = cell2 = cell(nrow, 1);
arr1 = arr2 = nan(nrow, 1);
strail = []; # Variable to store trailing characters in the loop
lastrow = 0;
fid = fopen("test.txt", "r");
while(!feof(fid))
  s = [strail, fscanf(fid, "%c", 1024^2)]; # Chunks of 1 MB
  # Use only "full" lines, and save trailing characters for later
  lastnl = find(s=="\n", 1, "last");
  strail = s(lastnl+1:end);
  s = s(1:lastnl);
  # Here manipulate the string "s"
  # write the number of written lines in variable "n"
  # and fill in cell1(lastrow + (1:n)), etc.
  lastrow += n;
endwhile
 
If your file does not have columns of the same width, you might use other programs to import it and export it with fixed-width columns, but you can also transform the original ascii string within Octave, and avoid writing new files into disk. Let's say that the maximum width of your columns is 5 characters:
 
maxlength = 5;

# Find separation characters (commas and new line)
sep = strchr(s, ",\n");
nval = length(sep);
# Actual length of your "words"
sl = [sep(1), diff(sep)];
# String with fixed width slots
sfull = ' '(ones(1,nval*(maxlength+1)));
# Positions of separation characters in "sfull"
sepfull = (maxlength + 1) * (1:nval);
# Fill in sfull backwards
for c = 0:maxlength
  sfull(sepfull - c) = ...
    ifelse(c < sl, s(max(1, sep - c)), ' ');
endfor
# Reshape sfull
sfull = reshape(sfull', (maxlength+1)*ncols)';
sfull(:,end)=[];
 
In Windows, you might want to use strrep to replace EOL characters ("\n\r") to UNIX-like "\n" as a first step.
 
Combining these two tricks, I have read a sample file of 44 MB in 17 seconds (reading chunks of 1MB), 50 times faster than looping over fscanf or fgetl + strsplit.
 
Helios De Rosario

IBV

SABEMOS QUÉ HICIMOS EN 2013
365 días de Innovación
ANUARIO DE BIOMECÁNICA
http://www.ibv.org/anuario2013
______________________________
INSTITUTO DE BIOMECÁNICA DE VALENCIA
Universidad Politécnica de Valencia • Edificio 9C
Camino de Vera s/n • 46022 VALENCIA (ESPAÑA)
Tel. +34 96 387 91 60 • Fax +34 96 387 91 69
www.ibv.org

Antes de imprimir este e-mail piense bien si es necesario hacerlo.
En cumplimiento de la Ley Orgánica 15/1999 reguladora de la Protección de Datos de Carácter Personal, le informamos de que el presente mensaje contiene información confidencial, siendo para uso exclusivo del destinatario arriba indicado. En caso de no ser usted el destinatario del mismo le informamos que su recepción no le autoriza a su divulgación o reproducción por cualquier medio, debiendo destruirlo de inmediato, rogándole lo notifique al remitente.

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

Philip Nienhuis
In reply to this post by Joao Rodrigues
Joao Rodrigues wrote
>> I need to import a large CSV file with multiple columns with mixed
>> string and number entries, such as:
>>
>> field1, field2, field3, field4
>> A,        a,        1,       1.0,
>> B,        b,        2,        2.0,
>> C,        c,        3,        3.0,
>>
>> and I want to pass this on to something like
>>
>> cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
>> cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
>> arr3 =[1 2 3]';
>> arr4 =[1.0 2.0 3.0]';
>>
>> furthermore, some columns can be ignored, the total number of entries is
>> known and there is a header.

<snip>

Yet, csv2cell is orders of magnitude faster. I will break the big file
into chunks (using fileread, strfind to determine newlines and fprintf)
and then apply csv2cell chunk-wise.
Why do you need to break it up using csv2cell? AFAICS that reads the entire file and directly translates the data into "values" in the output cell array, using very little temporary storage (the latter quite unlike textscan/strread).
It does read the entire file twice, once to assess the required dimensions for the cell array, the second (more intensive) pass for actually reading the data.

BTW, on my TODO list is an old (2+ years?) entry: adding a "headerlines" parameter to csv2cell..... (but OK my TODO list is looong)

Philip
Reply | Threaded
Open this post in threaded view
|

Re: how does fscanf with "C" work?

Thomas D. Dean-2
In reply to this post by Francesco Potortì
On 09/08/14 02:56, Francesco Potortì wrote:

> I'll try to be more specific.  If I get no answer from this list, I'll
> submit a bug report.
>
> I create a text file made of four lines, like this:
>
> ===File ~/math/workarea/test.txt============================
> field1, field2, field3, field4
> A,        a,        1,       1.0,
> B,        b,        2,        2.0,
> C,        c,        3,        3.0,
> ============================================================
>
> Then this is what i get:
>
> octave> version
> ans = 3.8.2
> octave> fid=fopen("test.txt");
> octave> fgetl(fid);
> octave> [v1, v2, v3, v4, count, errmsg] = fscanf(fid, "%s%s%d,%f,", "C")
> v1 = A,
> v2 = a,
> v3 =  1
> v4 =  1
> count =  4
> errmsg =
> octave> fclose(fid);

I have not been following this, but, fscanf is doing exactly what you
asked it to.

Read 'help fscanf' carefully.  You are using the second form.

fd=fopen("xx.dat")
fgetl(fd)
[v1, v2, v3, v4, count, errmsg] = fscanf(fd, "%s%s%d,%f", "C")

This is iterative and so, slow.

You can read the entire file using the first form of fscanf.  However,
since the strings are not quoted, octave will translate the ascii
characters to numerical byte values.
A==65, B==66, etc.

fd=fopen("xx.dat")
fgetl(fd)
[A,count,msg]=fscanf(fd,"%s%s%d,%f%s")

The contents of A show this conversion and also, the extra ',' on each
line that must be accounted for in the format.

If you only need the numerical values, you can use csvread()
If you do that, you will notice there are 5 columns, not 4.  And, the
strings are converted to zeros.

A= csvread("xx.dat")

By the way, your first paragraph sounds threatening and makes many
people not answer.  And, a bug report on this will most likely be ignored.

Tom Dean

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Import large field-delimited file with strings and numbers

CdeMills
In reply to this post by Joao Rodrigues
Joao Rodrigues wrote
I need to import a large CSV file with multiple columns with mixed
string and number entries, such as:

field1, field2, field3, field4
A,        a,        1,       1.0,
B,        b,        2,        2.0,
C,        c,        3,        3.0,

and I want to pass this on to something like

cell1 ={[1,1] = A; [2,1] = B; [3,1] = C};
cell2 ={[1,1] = a; [2,1] = b; [3,1] = c};
arr3 =[1 2 3]';
arr4 =[1.0 2.0 3.0]';

furthermore, some columns can be ignored, the total number of entries is
known and there is a header.

How can I perform the import within reasonable time and little memory
overhead? Below are a few of my attempts.

Octave offers a wide range of functions to import files (csvread,
dlmread, textscan, textread, fscanf, fgetline) but as far as I can tell
none seems to get the job done.

csvread and dlmread don't work because they only handle numerical data.

textscan works eats up all the memory (the file is 200 MB, textscan's
memory usage was into the GB's). It doesn't allow to provide a priori
the size of the object.
Hello Joaõ,
this is the perfect job for the "dataframe" package. It reads a CSV file and returns an object which try to mimics a matrix. You can select ranges, perform arithmetic operations, ...

You'll find in on the octave-forge source.

Regards

Pascal
12