Standard example datasets

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Standard example datasets

apjanke-floss
Hi, Octave maintainers,

Some other statistical programs ship with standard example datasets and
methods to load or explore them. Does Octave have something like this?

For example, R ships with a bunch of example datasets in its "datasets"
package, and you can view a list of them by doing `data()`. And Matlab
ships with a bazillion example datasets that seem to all be just MAT
files in its source code root directories, that you can access with
load, like `load patients`.

Use case: I'm working on table stuff, and would like to add some example
tabular datasets in my package. Wondering if there's a standard
mechanism I should integrate with.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

siko1056
On Sat, Apr 27, 2019 at 10:02 PM Andrew Janke <[hidden email]> wrote:
Hi, Octave maintainers,

Some other statistical programs ship with standard example datasets and
methods to load or explore them. Does Octave have something like this?

For example, R ships with a bunch of example datasets in its "datasets"
package, and you can view a list of them by doing `data()`. And Matlab
ships with a bazillion example datasets that seem to all be just MAT
files in its source code root directories, that you can access with
load, like `load patients`.

Use case: I'm working on table stuff, and would like to add some example
tabular datasets in my package. Wondering if there's a standard
mechanism I should integrate with.

Cheers,
Andrew


All that Octave ships currently seems to be the penny.mat [1].  But I am afraid, that this does not help with tabular datasets.  If you have got some small data set, it might be possible to extend Octave's "sparse" data collection.

Best,
Kai

Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

Carnë Draug
In reply to this post by apjanke-floss
On Sat, 27 Apr 2019 at 21:02, Andrew Janke <[hidden email]> wrote:

>
> Hi, Octave maintainers,
>
> Some other statistical programs ship with standard example datasets and
> methods to load or explore them. Does Octave have something like this?
>
> For example, R ships with a bunch of example datasets in its "datasets"
> package, and you can view a list of them by doing `data()`. And Matlab
> ships with a bazillion example datasets that seem to all be just MAT
> files in its source code root directories, that you can access with
> load, like `load patients`.
>
> Use case: I'm working on table stuff, and would like to add some example
> tabular datasets in my package. Wondering if there's a standard
> mechanism I should integrate with.
>

Matlab also comes with such datasets.  Ideally we would have the same
so that examples that use them work in Octave as well.  It would also
simplify some test cases which require generation of input data (I
would arguee that would actually enable them because if generation of
such complex datasets is too complicated then there's no tests for
them).

Anyway, there is already an item on the tracker [1] that lists the
ones in Matlab.  The issue is finding who is the copyright holder of
such data and contact them.

[1] https://savannah.gnu.org/patch/?9544

Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

apjanke-floss

On 4/28/19 8:27 AM, Carnë Draug wrote:

> On Sat, 27 Apr 2019 at 21:02, Andrew Janke <[hidden email]> wrote:
>>
>> Hi, Octave maintainers,
>>
>> Some other statistical programs ship with standard example datasets and
>> methods to load or explore them. Does Octave have something like this?
>>
>> For example, R ships with a bunch of example datasets in its "datasets"
>> package, and you can view a list of them by doing `data()`. And Matlab
>> ships with a bazillion example datasets that seem to all be just MAT
>> files in its source code root directories, that you can access with
>> load, like `load patients`.
>>
>> Use case: I'm working on table stuff, and would like to add some example
>> tabular datasets in my package. Wondering if there's a standard
>> mechanism I should integrate with.
>>
>
> Matlab also comes with such datasets.  Ideally we would have the same
> so that examples that use them work in Octave as well.  It would also
> simplify some test cases which require generation of input data (I
> would arguee that would actually enable them because if generation of
> such complex datasets is too complicated then there's no tests for
> them).
>
> Anyway, there is already an item on the tracker [1] that lists the
> ones in Matlab.  The issue is finding who is the copyright holder of
> such data and contact them.
>
> [1] https://savannah.gnu.org/patch/?9544
>

Do we have any lawyers or software licensing experts on the list?

My understanding is that simple databases are not subject to copyright,
under the "you can't copyright facts" principle. They're just subject to
whatever licensing terms you signed a contract to get access to the data
under.

I'm looking through the R source code. R's example datasets are mostly
little datasets written out in source code like this:

"VADeaths" <-
structure(c(11.7, 18.1, 26.9, 41, 66, 8.7, 11.7, 20.3, 30.9, 54.3, 15.4,
24.3, 37, 54.6, 71.1, 8.4, 13.6, 19.3, 35.1, 50), .Dim = c(5, 4),
.Dimnames = list(c("50-54", "55-59", "60-64", "65-69", "70-74"),
c("Rural Male", "Rural Female", "Urban Male", "Urban Female")))


Could we just take the numbers from the R code, either under the "no
copyright for dbs" rule, or under the same license that R itself is
distributed under, rewrite it as M-code, and include those?

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

Alois Schloegl-7
In reply to this post by Carnë Draug
On 28.04.19 14:27, Carnë Draug wrote:

> On Sat, 27 Apr 2019 at 21:02, Andrew Janke <[hidden email]> wrote:
>>
>> Hi, Octave maintainers,
>>
>> Some other statistical programs ship with standard example datasets and
>> methods to load or explore them. Does Octave have something like this?
>>
>> For example, R ships with a bunch of example datasets in its "datasets"
>> package, and you can view a list of them by doing `data()`. And Matlab
>> ships with a bazillion example datasets that seem to all be just MAT
>> files in its source code root directories, that you can access with
>> load, like `load patients`.
>>
>> Use case: I'm working on table stuff, and would like to add some example
>> tabular datasets in my package. Wondering if there's a standard
>> mechanism I should integrate with.
>>
>
> Matlab also comes with such datasets.  Ideally we would have the same
> so that examples that use them work in Octave as well.  It would also
> simplify some test cases which require generation of input data (I
> would arguee that would actually enable them because if generation of
> such complex datasets is too complicated then there's no tests for
> them).
>
> Anyway, there is already an item on the tracker [1] that lists the
> ones in Matlab.  The issue is finding who is the copyright holder of
> such data and contact them.
>
> [1] https://savannah.gnu.org/patch/?9544
>


Hi,


please consider also load_fisheriris (part of NaN-toolbox), which
downloads the data from [1] and caches the files in the current working
directory. This approach requires a network connection at runtime, when
the data is first downloaded. Perhaps a similar approach would be
suitable for other data sets.

If licensing of data is an issue, such download and cache mechanism
might be a viable solution. And the function "load" could provide a
functionality such that
    load fisheriris
would work out of the box.

BTW, the site [1] contains a number of other data sets, that octave
might want to support.

[1] http://archive.ics.uci.edu/ml/machine-learning-databases/iris/


Cheers,
  Alois



Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

Carnë Draug
In reply to this post by apjanke-floss
On Thu, 2 May 2019 at 01:22, Andrew Janke <[hidden email]> wrote:

>
>
> On 4/28/19 8:27 AM, Carnë Draug wrote:
> > On Sat, 27 Apr 2019 at 21:02, Andrew Janke <[hidden email]> wrote:
> >>
> >> Hi, Octave maintainers,
> >>
> >> Some other statistical programs ship with standard example datasets and
> >> methods to load or explore them. Does Octave have something like this?
> >>
> >> For example, R ships with a bunch of example datasets in its "datasets"
> >> package, and you can view a list of them by doing `data()`. And Matlab
> >> ships with a bazillion example datasets that seem to all be just MAT
> >> files in its source code root directories, that you can access with
> >> load, like `load patients`.
> >>
> >> Use case: I'm working on table stuff, and would like to add some example
> >> tabular datasets in my package. Wondering if there's a standard
> >> mechanism I should integrate with.
> >>
> >
> > Matlab also comes with such datasets.  Ideally we would have the same
> > so that examples that use them work in Octave as well.  It would also
> > simplify some test cases which require generation of input data (I
> > would arguee that would actually enable them because if generation of
> > such complex datasets is too complicated then there's no tests for
> > them).
> >
> > Anyway, there is already an item on the tracker [1] that lists the
> > ones in Matlab.  The issue is finding who is the copyright holder of
> > such data and contact them.
> >
> > [1] https://savannah.gnu.org/patch/?9544
> >
>
> Do we have any lawyers or software licensing experts on the list?
>
> My understanding is that simple databases are not subject to copyright,
> under the "you can't copyright facts" principle. They're just subject to
> whatever licensing terms you signed a contract to get access to the data
> under.

None of us are lawyers.  Some people will argue that datasets are
copyrightable.  There's a bunch of scientists struggling with the
whole thing about sharing data, and licenses for data are a real
thing.  Also, some of those datasets are images and photographs
including of paintings.

I think discussing this is outside the scope of Octave.

> I'm looking through the R source code. R's example datasets are mostly
> little datasets written out in source code like this:
>
> [...]
>
> Could we just take the numbers from the R code, either under the "no
> copyright for dbs" rule, or under the same license that R itself is
> distributed under, rewrite it as M-code, and include those?

The whole point I tried to made before was that it would be more
useful to have the same datasets as Matlab because it makes easier to
copy paste examples .  If you copy the datasets of R, then you will no
longer copy paste such example code into Octave at which point you
might as well make up your own datasets and side step the whole
copyright question.

Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

apjanke-floss


On 5/2/19 10:38 AM, Carnë Draug wrote:

> On Thu, 2 May 2019 at 01:22, Andrew Janke <[hidden email]> wrote:
>>
>>
>> On 4/28/19 8:27 AM, Carnë Draug wrote:
>>> On Sat, 27 Apr 2019 at 21:02, Andrew Janke <[hidden email]> wrote:
>>>>
>>>
>>> Anyway, there is already an item on the tracker [1] that lists the
>>> ones in Matlab.  The issue is finding who is the copyright holder of
>>> such data and contact them.
>>>
>>> [1] https://savannah.gnu.org/patch/?9544
>>>
>>
>> Do we have any lawyers or software licensing experts on the list?
>>
>> My understanding is that simple databases are not subject to copyright,
>> under the "you can't copyright facts" principle. They're just subject to
>> whatever licensing terms you signed a contract to get access to the data
>> under.
>
> None of us are lawyers.  Some people will argue that datasets are
> copyrightable.  There's a bunch of scientists struggling with the
> whole thing about sharing data, and licenses for data are a real
> thing.  Also, some of those datasets are images and photographs
> including of paintings.
>
> I think discussing this is outside the scope of Octave.

Guess I need to do some research on this.

>> I'm looking through the R source code. R's example datasets are mostly
>> little datasets written out in source code like this:
>>
>> [...]
>>
>> Could we just take the numbers from the R code, either under the "no
>> copyright for dbs" rule, or under the same license that R itself is
>> distributed under, rewrite it as M-code, and include those?
>
> The whole point I tried to made before was that it would be more
> useful to have the same datasets as Matlab because it makes easier to
> copy paste examples .  If you copy the datasets of R, then you will no
> longer copy paste such example code into Octave at which point you
> might as well make up your own datasets and side step the whole
> copyright question.

I get and agree with your other point: having full compatibility with
the Matlab example datasets, to the point of where you can
copy-and-paste the Matlab code using their example data sets, would be
excellent.

I don't think I can contribute to that, though: the Matlab example data
sets don't have public documentation; the only way you can see how
they're structured is by examining them in Matlab. By my reading, that's
a violation of the Matlab license's Non-Compete clause, which prohibits
using Matlab to develop any competing product, copyright aside. So I'm
not going to touch that; y'all can make your own decisions.

Another thing here is that the way Matlab organizes its example datasets
is lousy IMHO: they're just a bunch of mat-files dumped in the path. No
interface to list all the examples, get metadata about them, load them
through a uniform interface besides the plain "load(filename)", keep
them out of the global identifier namespace, etc. I like R's
organization of their example datasets into a "datasets" package with a
metadata lister.

I think it could also be useful to have Octave equivalents of the R
example datasets. Matlab isn't the only substitute for Octave; R and
numeric Python are, too. We might have users coming from R to Octave, or
vice versa. Having the same example datasets in both programs would be
useful for pedagogical purposes, so you can show how to perform the same
analysis on the same data in both languages, providing a sort of Rosetta
stone for them.

I've started working through the R example data sets and translating a
few of them into Octave, and it's proven to be a useful exercise,
exposing some bugs in my Chrono and Tablicious libraries. I'm going to
continue working through them, and if I end up with something useful,
I'll let y'all know. If you want to follow along, it's on this branch on
my GitHub repo:
https://github.com/apjanke/octave-tablicious/tree/example-datasets.


Alois Schloegl wrote:

> ...please consider also load_fisheriris (part of NaN-toolbox) ...
> ...BTW, the site [1] contains a number of other data sets...

Alois Schloegl: I've included the Fisher Iris dataset. Of course. It was
the first one I added. :) My approach includes storing the mat-file
version of the dataset in the source tree, so a network connection is
only needed at development time, not run time. And I'm looking through
that ics.uci.edu site for other potential example data sets.

Cheers,
Andrew


Reply | Threaded
Open this post in threaded view
|

Re: Standard example datasets

Michael Godfrey
In reply to this post by Carnë Draug
I have in the past had responsibility for computer-related material. And, I
know people who are expert in this field. The short answer to any question
about rights is that no one knows.  No one even agrees what laws apply to
computer codes or data. And, there is no known case, as far as I know, of
successful prosecutions even in the Java case. The Archive.org has received
demands that material be removed "immediately" but,... One fact is that if
no monetary gain is involved the case for prosecution is harder to make.
It is hard to argue that making data available that is already freely available
from other sources could be illegal (i.e. using R examples, which I also think
would be a good plan, should be acceptable). It might make sense to ask the
R folks if this is OK with them.

Even organizations which are clearly for profit put online, but charge for,
materials, including data, over which they have no copyright.
Wiley is a prominent example of this.
On Sat, 27 Apr 2019 at 21:02, Andrew Janke [hidden email] wrote:
Hi, Octave maintainers,

Some other statistical programs ship with standard example datasets and
methods to load or explore them. Does Octave have something like this?

For example, R ships with a bunch of example datasets in its "datasets"
package, and you can view a list of them by doing `data()`. And Matlab
ships with a bazillion example datasets that seem to all be just MAT
files in its source code root directories, that you can access with
load, like `load patients`.

Use case: I'm working on table stuff, and would like to add some example
tabular datasets in my package. Wondering if there's a standard
mechanism I should integrate with.

Matlab also comes with such datasets.  Ideally we would have the same
so that examples that use them work in Octave as well.  It would also
simplify some test cases which require generation of input data (I
would arguee that would actually enable them because if generation of
such complex datasets is too complicated then there's no tests for
them).

Anyway, there is already an item on the tracker [1] that lists the
ones in Matlab.  The issue is finding who is the copyright holder of
such data and contact them.

[1] https://savannah.gnu.org/patch/?9544