Quantcast

How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

Shoumei
I am a new learner of octave. I am trying to process large data files. I used to unzip them and open with excel, then count the columns &rows and loaded with "textread". I have to define the format with "%s " or "&f" for every single column. It becomes slower and harder to do this way with a file of e.g 49455x280. I tried to directly load them into octave but I got the error "inconsistent number of columns near line 2". Suggestions are anxiously needed.
I have octave 3.6.2 with pkg io,java and image installed.
Thanks
Shoumei
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

Philip Nienhuis
Shoumei wrote
I am a new learner of octave. I am trying to process large data files. I used to unzip them and open with excel, then count the columns &rows and loaded with "textread". I have to define the format with "%s " or "&f" for every single column. It becomes slower and harder to do this way with a file of e.g 49455x280. I tried to directly load them into octave but I got the error "inconsistent number of columns near line 2". Suggestions are anxiously needed.
I have octave 3.6.2 with pkg io,java and image installed.
Sorry, I don't fully understand.

What format are your data files (after unzipping)?
- Excel .xls or .xlsx files?
- plain text files with numeric and text columns?

If they are Excel files, you can obviously read them directly. But I suppose you have text files.
Once you have got them imported into Excel anyway, why don't you simply:
- save them from Excel into .xls and use xlsread or xlsopen-xls2oct-parsecell-xlsclose, or:
- save them from Excel into .csv and use csv2cell (optionally followed by parsecell to separate the numerical and text data).
Or are they too big? 50,000 X 300 is easily loaded into e.g., recent Excel and LibreOffice versions (capacity 10^6 rows by 1024 columns).

You can also try dlmread, if the file contents are simple and you don't care for the text contents.

Anyway, textread is a slow and cumbersome way to read simple data files with many columns. Its main use is that it does come in handy if you need to read complicated and ugly text files.

BTW somewhere at work I must have a quick-and-dirty Matlab/Octave function script that "explores" text files of the kind you probably refer to: simple with many columns. It returns a format string for use in strread/textread/textscan (optionally after skipping some user-defined number of header lines, or I might have finished to automate that as well, I don't remember). It was meant to further enhance textscan/strread/textread and simplify their use, but that mission somehow got swamped.
If you really want I can try to dig it up, but it can take until next week or later before I have an opportunity to search for it.

Philip
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

Shoumei
GSE1-2.txt
The extracted data files are txt files suffixed with ".soft' .  I need to extract from the files the data matrixices of usually repeated 44Kxn tables/matrices with a start line ' !series_matrix_table_begin' and end with ' !series_matrix_table_end'. The string data are 44Kx1 and the numerical data are 44x(n-1). I could possibly ignore the other ! comments. I knew the exact matrix size for each repeated matrixices.
The example file attached included data of two samples (GSM1&GSM2), each with a 2x2 matrix .
The complete txt file could not be loaded in excel or word.
The other tables in the sample file between"!platform_table_begin" and "!platform_table_end"  are info related to each ID/string-This info I could process from other files so are ignored for the time being.
When I dealt with small txt files I usually opened with excel and save the string (ID) as csv files and the numerical data as txt files. I had trouble comparing the strings with other string files if I save the whole file as csv and used "csv2cell' to load the data.
I was not able to use"xlsread' with mixed string and numerical xls data initially so I did not persist in using it.
Thanks
Shoumei
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

Philip Nienhuis
Shoumei wrote
GSE1-2.txt
The extracted data files are txt files suffixed with ".soft' .  I need to extract from the files the data matrixices of usually repeated 44Kxn tables/matrices with a start line ' !series_matrix_table_begin' and end with ' !series_matrix_table_end'. The string data are 44Kx1 and the numerical data are 44x(n-1). I could possibly ignore the other ! comments. I knew the exact matrix size for each repeated matrixices.
The example file attached included data of two samples (GSM1&GSM2), each with a 2x2 matrix .
That file format looks easy to read. Dataloggers we use at work yield more or less the same file structure (header followed by data sections between "something-like-begin-data" and "something-like-end-data" lines) and we have several Matlab scripts for reading those into a struct.
The file header is parsed line by line into separate fields, the data sections into dedicated data fields (often numeric arrays, sometimes cell arrays).

The complete txt file could not be loaded in excel or word.
What MS-Office version did you use?
As I wrote, LibreOffice Calc 3.4 and later versions should be able to read very very big files. But your computer RAM may be a limiting factor.

The other tables in the sample file between"!platform_table_begin" and "!platform_table_end"  are info related to each ID/string-This info I could process from other files so are ignored for the time being.
When I dealt with small txt files I usually opened with excel and save the string (ID) as csv files and the numerical data as txt files. I had trouble comparing the strings with other string files if I save the whole file as csv and used "csv2cell' to load the data.
You should save only the individual data sections as .csv and read them with csv2cell. Concatenating the sections can be done into a struct or so.

I was not able to use"xlsread' with mixed string and numerical xls data initially so I did not persist in using it.
Remarkable .... spreadsheets are just made for that. But they are usually fairly inefficient as far as RAM uage is concerned (because of a.o., the formatting overhead).

Philip
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: How to load data files with mixed str & numerical data with headers "*.txt.gz" or "*.xml.tgz"

Shoumei
Thanks for the comments &recommendations. It would be great if you could share some of the scripts to read this type of files.
You are right that my computer RAM is a limiting factor. The excel I am using is office 2010. It said something like the # of rows/ columns exceeds the capacity of excel (~1000Kx10K? I forgot the exact error message) when i tried to open the files.  

Shoumei
Loading...