Textscan Stops for Missing Data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Textscan Stops for Missing Data

gciriani
I'm reading a large tab-delimited file with a mix of text and numbers. It seems that Textscan does a better job than other solutions, especially because of the format %q and some text data are enclosed in quotation marks. However, where data are missing Textscan stops. Matlab seems to be able to keep going, but Octave stops. I tried a few of the options to no avail, and probably I do not understand their functioning properly.
fID = fopen("C:/Users/Giovanni/Documents/exmod.tab.txt");
C = textscan (fID, "%q %q %f %q %f %q %f % f%"...
      ,"HeaderLines", 1 ...
      ,"EmptyValue", NaN ...
      ,"Delimiter", "\t"...
      ,"ReturnOnError", true...%default value not needed
      );
Any suggestion is appreciated.
Giovanni Ciriani - Windows 10, Octave 4.2.1, configured for x86_64-w64-mingw32
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

NJank


On Aug 24, 2017 2:57 AM, "gciriani" <[hidden email]> wrote:
I'm reading a large tab-delimited file with a mix of text and numbers. It
seems that Textscan does a better job than other solutions, especially
because of the format %q and some text data are enclosed in quotation marks.
However, where data are missing Textscan stops. Matlab seems to be able to
keep going, but Octave stops.

Can you provide a few lines of sample data that cause the problem?

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

gciriani
This post was updated on .
I have attached sample exlong.tab of about 110 lines (after the header). The code I had outlined reads correctly the first 30 lines; at the 31st line it reads the data up to the column with the missing datum (column viewtime), and it stops.

The MatLab reference shows under section Specify Delimiter and Empty Value Conversion that it should be possible to keep going.

P.S.
For the example I just included the format string needs to be the following:
"%q %q %f %q %f %q %f %f %q %f %q %f %q %s %f %q"
Giovanni Ciriani - Windows 10, Octave 4.2.1, configured for x86_64-w64-mingw32
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

siko1056
I think in your case `textscan` has a problem with the Tab delimiter:

<code>
str = "\"1176-0\"\t5\t\"d\"\t4\n";
C = textscan (str, "%q %f %q %f","EmptyValue", NaN,"Delimiter", "\t", ...
  "ReturnOnError", true)
str = "\"1176-0\"\t\t\"d\"\t4\n";
C = textscan (str, "%q %f %q %f","EmptyValue", NaN,"Delimiter", "\t", ...
  "ReturnOnError", true)

## Replace delimiter with comma

str = "\"1176-0\",,\"d\",4\n";
C = textscan (str, "%q %f %q %f","EmptyValue", NaN,"Delimiter", ",", ...
  "ReturnOnError", true)
</code>

<output>
C =
{
  [1,1] =
  {
    [1,1] = 1176-0
  }

  [1,2] =  5
  [1,3] =
  {
    [1,1] = d
  }

  [1,4] =  4
}

C =
{
  [1,1] =
  {
    [1,1] = 1176-0
  }

  [1,2] = [](0x1)
  [1,3] = {}(0x1)
  [1,4] = [](0x1)
}

## Replace delimiter with comma

C =
{
  [1,1] =
  {
    [1,1] = 1176-0
  }

  [1,2] = NaN
  [1,3] =
  {
    [1,1] = d
  }

  [1,4] =  4
}
</output>

Maybe it is worth to file a bug report with the request to add a delimiter test for tabs like in [1].  For a temporary fix, you can use "Find&Replace" the Octave editor and convert you file to a comma separated value (CSV) file by substituting Tabs with commas.

HTH,
Kai

[1]: https://hg.savannah.gnu.org/hgweb/octave/file/e54e13ee99ce/libinterp/corefcn/file-io.cc#l1619
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

PhilipNienhuis
In reply to this post by gciriani
gciriani wrote
I have attached sample exlong.tab of about 110 lines (after the header). The code I had outlined reads correctly the first 30 lines; at the 31st line it reads the data up to the column with the missing datum (column viewtime), and it stops.

The MatLab reference shows under section Specify Delimiter and Empty Value Conversion that it should be possible to keep going.

P.S.
For the example I just included the format string needs to be the following:
"%q %q %f %q %f %q %f %f %q %f %q %f %q %s %f %q"
Another option would have been csv2cell() in the io package. That can read delimited data files with mixed text/numeric data.
But when I tried it turned out it couldn't read you file either, while AFAICS it should be able to (nothing wrong with the file)
I'll look into that.

Philip
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

gciriani
PhilipNienhuis wrote
... when I tried it turned out it couldn't read you file either, while AFAICS it should be able to (nothing wrong with the file)
I'll look into that.
Thank you! For the time being I'm able to circumvent the problem by pre-processing my data in Excel (i.e. adding a recognizable number, such as 999, to occurrences of blank cells). But it would be great to be able to deal with it directly in Octave in the future.
Giovanni Ciriani - Windows 10, Octave 4.2.1, configured for x86_64-w64-mingw32
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

Ozzy Lash
Have you tried setting the "WhiteSpace" option so that it doesn't include \t?  I see different behavior with that.  The only problem is, your file has a few lines that start with a tab, and that trows the processing off.  If I get rid of those tabs, it reads 110 entries.

The argument I used was "WhiteSpace", " \b\n\r"

Reading the documentation is strread for I see the default is " \b\n\r\t" , Reading that documentation, it does say:

          Whitespace is
          always added to the set of delimiter characters unless at
          least one "%s" format conversion specifier is supplied; in
          that case only whitespace explicitly specified in "delimiter"
          is retained as delimiter and removed from the set of
          whitespace characters.  If whitespace characters are to be
          kept as-is (in e.g., strings), specify an empty value (i.e.,
          "") for "whitespace"; obviously, whitespace cannot be a
          delimiter then.

Which is a bit confusing to me, but it sounds like, if there were a %s conversion, the \t might get deleted from the whitespace values.  It seems like it might make sense to always do that.

Bill

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

PhilipNienhuis
Ozzy:

Ozzy Lash wrote
Have you tried setting the "WhiteSpace" option so that it doesn't include
\t?  I see different behavior with that.  The only problem is, your file
has a few lines that start with a tab, and that trows the processing off.
If I get rid of those tabs, it reads 110 entries.

The argument I used was "WhiteSpace", " \b\n\r"

Reading the documentation is strread for I see the default is " \b\n\r\t" ,
Reading that documentation, it does say:

          Whitespace is
          always added to the set of delimiter characters unless at
          least one "%s" format conversion specifier is supplied; in
          that case only whitespace explicitly specified in "delimiter"
          is retained as delimiter and removed from the set of
          whitespace characters.  If whitespace characters are to be
          kept as-is (in e.g., strings), specify an empty value (i.e.,
          "") for "whitespace"; obviously, whitespace cannot be a
          delimiter then.

Which is a bit confusing to me, but it sounds like, if there were a %s
conversion, the \t might get deleted from the whitespace values.  It seems
like it might make sense to always do that.
Sorry but it looks like you're a bit confused?
strread.m isn't the engine behind textscan anymore.
Nowadays textscan is a separate, binary function and as such much faster than the combo textscan.m+strread.m used to be.

Philip
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

Ozzy Lash


On Sat, Aug 26, 2017 at 10:49 AM, PhilipNienhuis <[hidden email]> wrote:
Ozzy:


Ozzy Lash wrote
> Have you tried setting the "WhiteSpace" option so that it doesn't include
> \t?  I see different behavior with that.  The only problem is, your file
> has a few lines that start with a tab, and that trows the processing off.
> If I get rid of those tabs, it reads 110 entries.
>
> The argument I used was "WhiteSpace", " \b\n\r"
>
> Reading the documentation is strread for I see the default is " \b\n\r\t"
> ,

Sorry but it looks like you're a bit confused?
strread.m isn't the engine behind textscan anymore.
Nowadays textscan is a separate, binary function and as such much faster
than the combo textscan.m+strread.m used to be.

Philip




Oops, It looks like you are correct.  I was originally trying the example using an old version, and looked at the m-file for textscan, and saw it used strread, however the example didn't work, so I switched to a machine that has a development version on it, and now the documentation for textscan shows the "Whitespace" documentation. 

My suggestion still stands, changing the value of "Whitespace" gives different behavior, and if the lines that have a tab as the beginning character are trimmed so that they don't start with a tab, I think the behavior is what the original poster wanted. The documentation for "Whitespace" in textscan is more clear to me:

"Whitespace"
          Any character in VALUE will be interpreted as whitespace and
          trimmed; The default value for whitespace is " \b\r\n\t" (note
          the space).  Unless whitespace is set to "" (empty) AND at
          least one "%s" format conversion specifier is supplied, a
          space is always part of whitespace.

I wonder if it would make sense to remove the "Delimiter" value from the "Whitespace" values.

Bill

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

PhilipNienhuis
Ozzy Lash wrote on 27-Aug-17 00:09:

<snip>

> My suggestion still stands, changing the value of "Whitespace" gives
> different behavior, and if the lines that have a tab as the beginning
> character are trimmed so that they don't start with a tab, I think the
> behavior is what the original poster wanted. The documentation for
> "Whitespace" in textscan is more clear to me:
>
> "Whitespace"
>           Any character in VALUE will be interpreted as whitespace and
>           trimmed; The default value for whitespace is " \b\r\n\t" (note
>           the space).  Unless whitespace is set to "" (empty) AND at
>           least one "%s" format conversion specifier is supplied, a
>           space is always part of whitespace.
>
> I wonder if it would make sense to remove the "Delimiter" value from the
> "Whitespace" values.

It could be that Octave's textscan's default whitespace chars don't
match those of Matlab.
Anyway I think this is a bug in textscan.

Philip


_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

PhilipNienhuis
In reply to this post by gciriani
gciriani wrote
>
> PhilipNienhuis wrote
>> ... when I tried it turned out it couldn't read you file either, while
>> AFAICS it should be able to (nothing wrong with the file)
>> I'll look into that.
> Thank you! For the time being I'm able to circumvent the problem by
> pre-processing my data in Excel (i.e. adding a recognizable number, such
> as 999, to occurrences of blank cells). But it would be great to be able
> to deal with it directly in Octave in the future.

Hi,

(coming back to this)

I looked into your issue because I feared a bug in csv2cell(). Luckily that
works fine AFAICS.

However it appears that your tab-delimited text file is a bit messy:

Lines 3, 4 and 6 contain a leading TAB character. That should be no problem
per se, but as it is an extra delimiter (unintended or not) textscan gets
out-of-sync as it assumes a data field before the leading TAB.
Lines 1, 4, and 6 to 14 also have a *trailing* tab. That confuses textscan()
even more.

In fact textscan (and csv2cell) see 17 rather than 16 data columns in lines
1 and 7-14 (the 17th behind the trailing TAB characters, but empty), and
even 18 (the first before the leading TAB) in lines 4 and 6.
All that makes it hard for a picky program like textscan to do as you
instructed :-)

I suppose you could instruct textscan to skip trailing stuff by adding a
"*[^\n]" format specifier after the last one, telling it to skip any
characters after the last field ("column") you want to read until EOL.
Still, you'll see that it gets out-of-sync on lines 3, 4 and 6.

So I'd really suggest to clean up the program that created the exlong.tab
file in the first place, it could save you quite a bit of trouble :-)

BTW a hint:
I used notepad++ to read exlong.tab and made it show White Space and Tab in
View | Show Symbols

Philip




--
Sent from: http://octave.1599824.n4.nabble.com/Octave-General-f1599825.html

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Reply | Threaded
Open this post in threaded view
|

Re: Textscan Stops for Missing Data

gciriani
Thank you for taking the time, and finding out that there were extra tabs. I
created the the text example starting from the original file, and I might
have added extra characters. I will check the original file, and report
back. Good idea to use Notepad++!
PhilipNienhuis wrote
> ... Lines 1, 4, and 6 to 14 also have a *trailing* tab. That confuses
> textscan() even more. ...





-----
Giovanni Ciriani - Windows 10, Octave 4.2.1, configured for x86_64-w64-mingw32
--
Sent from: http://octave.1599824.n4.nabble.com/Octave-General-f1599825.html

_______________________________________________
Help-octave mailing list
[hidden email]
https://lists.gnu.org/mailman/listinfo/help-octave
Giovanni Ciriani - Windows 10, Octave 4.2.1, configured for x86_64-w64-mingw32