Re: [GSoC 2021] How should I do now with project Table datatype

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [GSoC 2021] How should I do now with project Table datatype

siko1056
On 3/8/21 6:51 PM, 陈栋林 wrote:
> I have seen that you are the potential mentors in the project Table
> datatype. How should I do now with this project for applying gsoc? How
> can I make my first contribution? Thank you

Thank you for your interest in GSoC with Octave.  Yes, I am willing to
mentor a project on creating a Matlab compatible table datatype [1].

I think tomorrow the mentoring organizations will be announced by
Google.  If Octave is chosen (of course you are always free to work on
this project outside GSoC as well), you can familiarize yourself with
the existing codes and improve them (a bit).  An "easy" potential
starting point to show your Octave coding skills is creating some BIST
[2] for "octave-tablicious" [3], for example.  That means create a fork
of "octave-tablicious" or start a new Octave package [4] copying a
subset of that project.

Two more things if you prefer to communicate via email:
1. Please keep the Octave maintainers mailing-list in the CC and add a
subject prefix "[GSoC]" or "[GSoC 2021]".
2. Please answer below the previous post ("bottom-posting").

Kai

[1] https://wiki.octave.org/Summer_of_Code_-_Getting_Started#Table_datatype
[2] https://wiki.octave.org/Tests
[3] https://github.com/apjanke/octave-tablicious/issues/30
[4] https://github.com/gnu-octave/pkg-example

Reply | Threaded
Open this post in threaded view
|

Re: [GSoC 2021] How should I do now with project Table datatype

apjanke-floss

On 3/8/21 11:13 PM, Kai Torben Ohlhus wrote:

> On 3/8/21 6:51 PM, 陈栋林 wrote:
>> I have seen that you are the potential mentors in the project Table
>> datatype. How should I do now with this project for applying gsoc? How
>> can I make my first contribution? Thank you
>
> Thank you for your interest in GSoC with Octave.  Yes, I am willing to
> mentor a project on creating a Matlab compatible table datatype [1].
>
> I think tomorrow the mentoring organizations will be announced by
> Google.  If Octave is chosen (of course you are always free to work on
> this project outside GSoC as well), you can familiarize yourself with
> the existing codes and improve them (a bit).  An "easy" potential
> starting point to show your Octave coding skills is creating some BIST
> [2] for "octave-tablicious" [3], for example.  That means create a fork
> of "octave-tablicious" or start a new Octave package [4] copying a
> subset of that project.
>
> Two more things if you prefer to communicate via email:
> 1. Please keep the Octave maintainers mailing-list in the CC and add a
> subject prefix "[GSoC]" or "[GSoC 2021]".
> 2. Please answer below the previous post ("bottom-posting").
>
> Kai
>
> [1] https://wiki.octave.org/Summer_of_Code_-_Getting_Started#Table_datatype
> [2] https://wiki.octave.org/Tests
> [3] https://github.com/apjanke/octave-tablicious/issues/30
> [4] https://github.com/gnu-octave/pkg-example
>

Hi, 陈栋林!

I'm Andrew Janke, the author of octave-tablicious. I would be happy to
accept PRs for BISTs to Tablicious, and to help you get an adaptation of
its Table code or similar in to core Octave, to the extent that I have
time. You are of course also welcome to just take its code and use it in
a separate project.

Tablicious isn't an official GNU Octave project, and I don't have much
free time, so I wouldn't be an official GSoC mentor. But I'll help out
as time permits. Feel free to Cc me if you have questions about
Tablicious and I'll try to answer them (probably in the form of adding
documentation to the package).

For what it's worth, I think this is a good idea for a GSoC project.
Tables (or "dataframes") are an important part of modern Matlab, Python,
and R coding; it would be nice to see them readily available in Octave.

Please note that much of Tablicious's Table logic depends on a special
trick called "proxy keys" that I came up with for doing efficient
matching on multiple mixed-type columns (for use in operations such as
joins and membership tests). I think it's a good idea, but I don't know
if the core Octave maintainers agree; you might need to come up with
alternate matching logic if they're not a fan of it.

Sorry for not chiming in about this earlier when y'all were setting up
the GSoC stuff. The Wiki page says the project goal is to "define an
initial subset of table functions, which involve sorting, splitting,
merging, and file I/O". I'd suggest that rather than working on one or
two functions at a time, the project focus on choosing an overall
underlying data model or API for the Table data structure (that is,
deciding whether you want to use "proxy keys" or some other approach),
because almost all Table operations (besides I/O) are going to naturally
be built on top of that data model: mixed-type multi-column equivalence
and order testing is not something that is supported by other existing
Octave operations, so you need to decide how you're fundamentally going
to deal with that. And almost all Table operations really boil down to
variations of mixed-type multi-column equivalence or sorting. (I'd also
suggest that whatever model you decide, it should be formally defined in
terms of M-code-level operations or functions, so that user-defined
classes and new Octave types can be readily supported by the Table array
type. For example, Tablicious's "proxy keys" model is defined in terms
of the unique(), sort(), and eq() functions on the types in table
columns. (With a special exception for eq() for cell columns. (Yuck,
cellstrs.)))

If you want to get some theoretical background on tables, I would highly
recommend reading C.J. Date's book "Database In Depth" [1], which
describes the Relational Model that is the major theoretical basis for
table arrays and similar structures, in both SQL and in-memory
representations. (Or, if you're feeling more ambitious, try Date's other
book "An Introduction to Database Systems" [2], which is a
college-textbook style treatment of the same subject matter.)

Also, there's one reference missing from the Table section of the Octave
GSoC wiki page: the Octave Forge Dataframe package [3] is another
initial implementation of a table-like data structure. It does not
follow the Matlab table array API, but is conceptually and functionally
similar, and should probably be consulted for this project.

Cheers,
Andrew

[1] https://www.oreilly.com/library/view/database-in-depth/0596100124/
[2]
https://www.pearson.com/us/higher-education/program/Date-An-Introduction-to-Database-Systems-8th-Edition/PGM274345.html
[3] https://wiki.octave.org/Dataframe_package

Reply | Threaded
Open this post in threaded view
|

Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss
In reply to this post by siko1056

On 3/8/21 11:13 PM, Kai Torben Ohlhus wrote:
> On 3/8/21 6:51 PM, 陈栋林 wrote:
>> I have seen that you are the potential mentors in the project Table
>> datatype. How should I do now with this project for applying gsoc? How
>> can I make my first contribution? Thank you
>
> Thank you for your interest in GSoC with Octave.  Yes, I am willing to
> mentor a project on creating a Matlab compatible table datatype [1].

Since we're on the subject...

I know there hasn't been much enthusiasm for this in the past, but is
there any chance I could get y'all interested in including Tablicious
[1] in Octave Forge as an "External" package? Even if table arrays make
it in to core Octave from the GSoC work, I think Tablicious could be
useful as a transitional package or to support older versions of Octave
(which I think are still in kind-of wide use?). And it also provides
datetimes, categoricals, and (half-assed) string arrays.

Cheers,
Andrew

[1] https://github.com/apjanke/octave-tablicious

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

jbect
Le 09/03/2021 à 09:58, Andrew Janke a écrit :

I know there hasn't been much enthusiasm for this in the past, but is there any chance I could get y'all interested in including Tablicious [1] in Octave Forge as an "External" package? Even if table arrays make it in to core Octave from the GSoC work, I think Tablicious could be useful as a transitional package or to support older versions of Octave (which I think are still in kind-of wide use?). And it also provides datetimes, categoricals, and (half-assed) string arrays.


+1

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

siko1056
In reply to this post by apjanke-floss
On 3/9/21 5:58 PM, Andrew Janke wrote:

>
> On 3/8/21 11:13 PM, Kai Torben Ohlhus wrote:
>> On 3/8/21 6:51 PM, 陈栋林 wrote:
>>> I have seen that you are the potential mentors in the project Table
>>> datatype. How should I do now with this project for applying gsoc?
>>> How can I make my first contribution? Thank you
>>
>> Thank you for your interest in GSoC with Octave.  Yes, I am willing to
>> mentor a project on creating a Matlab compatible table datatype [1].
>
> Since we're on the subject...
>
> I know there hasn't been much enthusiasm for this in the past, but is
> there any chance I could get y'all interested in including Tablicious
> [1] in Octave Forge as an "External" package? Even if table arrays make
> it in to core Octave from the GSoC work, I think Tablicious could be
> useful as a transitional package or to support older versions of Octave
> (which I think are still in kind-of wide use?). And it also provides
> datetimes, categoricals, and (half-assed) string arrays.
>
> Cheers,
> Andrew
>


Thank you for your offer to support the GSoC project =)

Your tablicious package is indeed very useful.  I can offer you to
obtain the necessary rights to access of what is left abandoned of the
Octave Forge (OF) website and make any necessary changes to add your
package there.  Just tell me your SourceForge username.

If you have figured out a working procedure how to add a new package,
please document it in the wiki [2] including a time estimate.

It is not like nobody wants to see your package on OF, but all the
admins having the time and knowledge how to do it without breaking the
current design, etc. are no longer active.

Kai


[1] https://github.com/apjanke/octave-tablicious
[2] https://wiki.octave.org/Reviewing_Octave_Forge_packages#Admin_tasks
[3] https://gnu-octave.github.io/pkg-index/package/tablicious

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss


On 3/9/21 11:20 PM, Kai Torben Ohlhus wrote:

> On 3/9/21 5:58 PM, Andrew Janke wrote:
>>
>> On 3/8/21 11:13 PM, Kai Torben Ohlhus wrote:
>>> On 3/8/21 6:51 PM, 陈栋林 wrote:
>>>> I have seen that you are the potential mentors in the project Table
>>>> datatype. How should I do now with this project for applying gsoc?
>>>> How can I make my first contribution? Thank you
>>>
>>> Thank you for your interest in GSoC with Octave.  Yes, I am willing
>>> to mentor a project on creating a Matlab compatible table datatype [1].
>>
>> Since we're on the subject...
>>
>> I know there hasn't been much enthusiasm for this in the past, but is
>> there any chance I could get y'all interested in including Tablicious
>> [1] in Octave Forge as an "External" package? Even if table arrays
>> make it in to core Octave from the GSoC work, I think Tablicious could
>> be useful as a transitional package or to support older versions of
>> Octave (which I think are still in kind-of wide use?). And it also
>> provides datetimes, categoricals, and (half-assed) string arrays.
>>
>> Cheers,
>> Andrew
>>
>
>
> Thank you for your offer to support the GSoC project =)
>
> Your tablicious package is indeed very useful.  I can offer you to
> obtain the necessary rights to access of what is left abandoned of the
> Octave Forge (OF) website and make any necessary changes to add your
> package there.  Just tell me your SourceForge username.

Cool. I'm "ajanke" on SourceForge.

> If you have figured out a working procedure how to add a new package,
> please document it in the wiki [2] including a time estimate.

I'll give it a go!

> It is not like nobody wants to see your package on OF, but all the
> admins having the time and knowledge how to do it without breaking the
> current design, etc. are no longer active.

Totally understandable.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

siko1056
On Thu, Mar 11, 2021 at 7:06 PM Andrew Janke <[hidden email]> wrote:

On 3/9/21 11:20 PM, Kai Torben Ohlhus wrote:
>
> Your tablicious package is indeed very useful.  I can offer you to
> obtain the necessary rights to access of what is left abandoned of the
> Octave Forge (OF) website and make any necessary changes to add your
> package there.  Just tell me your SourceForge username.

Cool. I'm "ajanke" on SourceForge.

> If you have figured out a working procedure how to add a new package,
> please document it in the wiki [2] including a time estimate.

I'll give it a go!

> It is not like nobody wants to see your package on OF, but all the
> admins having the time and knowledge how to do it without breaking the
> current design, etc. are no longer active.

Totally understandable.

Cheers,
Andrew

You have got the power now ;-)

Kai
Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss


On 3/11/21 8:27 AM, Kai Torben Ohlhus wrote:

> On Thu, Mar 11, 2021 at 7:06 PM Andrew Janke <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>     On 3/9/21 11:20 PM, Kai Torben Ohlhus wrote:
>     >
>     > Your tablicious package is indeed very useful.  I can offer you to
>     > obtain the necessary rights to access of what is left abandoned of the
>     > Octave Forge (OF) website and make any necessary changes to add your
>     > package there.  Just tell me your SourceForge username.
>
>     Cool. I'm "ajanke" on SourceForge.
>
>     > If you have figured out a working procedure how to add a new package,
>     > please document it in the wiki [2] including a time estimate.
>
>     I'll give it a go!
>
>     > It is not like nobody wants to see your package on OF, but all the
>     > admins having the time and knowledge how to do it without breaking the
>     > current design, etc. are no longer active.
>
>     Totally understandable.
>
>     Cheers,
>     Andrew
>
>
> You have got the power now ;-)
>
> Kai

[hacker voice] I'm in!

Good chance I'll have time to look in to this over the weekend.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

John W. Eaton
Administrator
In reply to this post by apjanke-floss
On 3/9/21 3:58 AM, Andrew Janke wrote:

> Even if table arrays make
> it in to core Octave from the GSoC work, I think Tablicious could be
> useful as a transitional package or to support older versions of Octave
> (which I think are still in kind-of wide use?). And it also provides
> datetimes, categoricals, and (half-assed) string arrays.

If you want to create a forge package that would be great, but I'd also
be glad to include these classes in core Octave and can offer some help
with that if you are interested.

jwe


Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss
On 3/11/21 2:59 PM, John W. Eaton wrote:

> On 3/9/21 3:58 AM, Andrew Janke wrote:
>
>> Even if table arrays make it in to core Octave from the GSoC work, I
>> think Tablicious could be useful as a transitional package or to
>> support older versions of Octave (which I think are still in kind-of
>> wide use?). And it also provides datetimes, categoricals, and
>> (half-assed) string arrays.
>
> If you want to create a forge package that would be great, but I'd also
> be glad to include these classes in core Octave and can offer some help
> with that if you are interested.
>
> jwe

Oh, good! I am definitely interested. My goal for Tablicious has always
been to build this into something which I can contribute up to core
Octave, because:
  a) IMHO this stuff is quite useful for modern data-sciencey stuff and
maybe all Octave users should get it,
  b) Some of the semantics of Tablicious really need to be implemented
as core Octave language features, not just a library, and
  c) This is all stuff that is in base Matlab in recent versions, so for
feature parity and compatibility, it should end up in core Octave too.

I don't think that Tablicious is something that is ready for absorption
in to core Octave yet, because the code quality is not good enough yet
(seriously), and there are some fundamental semantic issues that need to
be worked out. In particular, I think we might need to sort out:
  * Whether and how core Octave wants to implement `missing` semantics
for its datatypes.
  * Whether my "proxy key" approach for tables is a good one, and one
that core Octave wants to adopt, and whether this should be done all in
terms of M-code functions or if there should be built-in support added
for it.
  * How Octave as a whole wants to deal with time zone definitions, the
Olson database, updates to it, interaction with OS-provided time zone
definitions, and so on.
  * Aaaaaand string arrays and the long-term plan for double-quoted
string literals in Octave. (Because handling cells in table columns is
really a pain.)

This probably means that I should actually write some decent
documentation about Tablicious's table join semantics and the "proxy
key" technique.

Also, time zone support in Tablicious is currently definitely broken,
and not good enough for inclusion in core Octave yet.

My intention for having Tablicious included in Octave Forge is just a
way of getting the word out about Tablicious to new users, since I think
Octave Forge is probably the first place most Octave users go to find
out what extension packages are available for it? Whether Tablicious
makes it in to Forge or not, my long term hope is for all this to end up
in core Octave, and the Tablicious package is just a proof-of-concept
and support-for-old-Octave versions thing.

There's one hitch: since starting the Tablicious project, I've been
exposed to MathWorks' intellectual property around the datetime class,
so I can probably no longer do any work on the datetime and maybe time
zone stuff in Tablicious, and that probably needs to be handed over to
somebody else. But that's fine: the semantics of datetime are
well-established, and don't interact in complex ways with table array
semantics; it can be cleanly separated.

In the short term, I'm going to try to figure out this new Forge package
thing, because I'd really like the extra eyeballs on my package, and I
think integration into core Octave is a ways off. Maybe the upcoming
GSoC will help.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

Guillaume
I think having a GSoC student working on the implementation of table would be great and I would skip the Forge packaging to focus on adding the functionality in core Octave (of course, reusing as much as possible of Tablicious).
Once the low level data structure is in place (a write-up on "proxy key" would be great), the burden of the implementation of all table related functions could be shared among all interested parties:
Would it also be possible to decouple projects? Why not implementing (part of) table without requiring strings or datetime or categorical or missing? I feel that if the foundations are there and robust, we can build the rest one bit at a time.

Guillaume.


On Thu, 11 Mar 2021 at 20:34, Andrew Janke <[hidden email]> wrote:
On 3/11/21 2:59 PM, John W. Eaton wrote:
> On 3/9/21 3:58 AM, Andrew Janke wrote:
>
>> Even if table arrays make it in to core Octave from the GSoC work, I
>> think Tablicious could be useful as a transitional package or to
>> support older versions of Octave (which I think are still in kind-of
>> wide use?). And it also provides datetimes, categoricals, and
>> (half-assed) string arrays.
>
> If you want to create a forge package that would be great, but I'd also
> be glad to include these classes in core Octave and can offer some help
> with that if you are interested.
>
> jwe

Oh, good! I am definitely interested. My goal for Tablicious has always
been to build this into something which I can contribute up to core
Octave, because:
  a) IMHO this stuff is quite useful for modern data-sciencey stuff and
maybe all Octave users should get it,
  b) Some of the semantics of Tablicious really need to be implemented
as core Octave language features, not just a library, and
  c) This is all stuff that is in base Matlab in recent versions, so for
feature parity and compatibility, it should end up in core Octave too.

I don't think that Tablicious is something that is ready for absorption
in to core Octave yet, because the code quality is not good enough yet
(seriously), and there are some fundamental semantic issues that need to
be worked out. In particular, I think we might need to sort out:
  * Whether and how core Octave wants to implement `missing` semantics
for its datatypes.
  * Whether my "proxy key" approach for tables is a good one, and one
that core Octave wants to adopt, and whether this should be done all in
terms of M-code functions or if there should be built-in support added
for it.
  * How Octave as a whole wants to deal with time zone definitions, the
Olson database, updates to it, interaction with OS-provided time zone
definitions, and so on.
  * Aaaaaand string arrays and the long-term plan for double-quoted
string literals in Octave. (Because handling cells in table columns is
really a pain.)

This probably means that I should actually write some decent
documentation about Tablicious's table join semantics and the "proxy
key" technique.

Also, time zone support in Tablicious is currently definitely broken,
and not good enough for inclusion in core Octave yet.

My intention for having Tablicious included in Octave Forge is just a
way of getting the word out about Tablicious to new users, since I think
Octave Forge is probably the first place most Octave users go to find
out what extension packages are available for it? Whether Tablicious
makes it in to Forge or not, my long term hope is for all this to end up
in core Octave, and the Tablicious package is just a proof-of-concept
and support-for-old-Octave versions thing.

There's one hitch: since starting the Tablicious project, I've been
exposed to MathWorks' intellectual property around the datetime class,
so I can probably no longer do any work on the datetime and maybe time
zone stuff in Tablicious, and that probably needs to be handed over to
somebody else. But that's fine: the semantics of datetime are
well-established, and don't interact in complex ways with table array
semantics; it can be cleanly separated.

In the short term, I'm going to try to figure out this new Forge package
thing, because I'd really like the extra eyeballs on my package, and I
think integration into core Octave is a ways off. Maybe the upcoming
GSoC will help.

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss

On 3/12/21 9:25 AM, Guillaume wrote:
> I think having a GSoC student working on the implementation of table
> would be great and I would skip the Forge packaging to focus on adding
> the functionality in core Octave (of course, reusing as much as possible
> of Tablicious).

Okay. I'll think about it. To be honest, I think implementing table
might be kind of a large project so it might be a while before it lands
in an Octave release, and I'd still like to get more user eyeballs on
Tablicious in the mean time.

(I don't expect anyone else to put time in to getting Tablicious on
Forge; that'd be just me.)

> Once the low level data structure is in place (a write-up on "proxy key"
> would be great),

I'll probably make documenting "proxy key" and related stuff my top
priority here, then.

> the burden of the implementation of all table related
> functions could be shared among all interested parties:
> https://www.mathworks.com/help/matlab/tables.html
> Would it also be possible to decouple projects? Why not implementing
> (part of) table without requiring strings or datetime or categorical or
> missing? I feel that if the foundations are there and robust, we can
> build the rest one bit at a time.

Yes, mostly: I think table needs missing/ismissing(), because that's a
fundamental part of the join semantics. Defining and implementing
missing/ismissing() is pretty easy.

The other types - datetime, string, categorical - are mostly separable.I
just threw them in to Tablicious because I think they are rather useful
types to stick in to columns in tables, and I wanted users to be able to
install just the one package instead of 3 or 4.

Except for how those types get displayed when a table is pretty-printed
or exported to Excel or CSV: there's no existing interface (in Octave or
in Matlab) for defining custom displays or conversions for types or
classes on a per-element basis when those types are placed inside a
composite type (as opposed to a custom display format for the overall
array, like disp() does).

That is, for datetimes in a table:

dt = datetime({'1/1/2001', '2/1/2001', '3/1/2001'}');
t = table(dt);
prettyprint(t)

I really want it to do this:

        dt
    ___________
    01-Jan-2001
    01-Feb-2001
    01-Mar-2001

and not this:

           dt
    _________________
    [1-by-1 datetime]
    [1-by-1 datetime]
    [1-by-1 datetime]

If the interface for defining that sort of display could be defined,
then string/categorical/datetime are totally decouple-able. (Or you
could just give up on a generic display customization interface and
write up special-case handling for each of string, categorical, and
datetime when they do appear in Octave. That might be the
maximally-Matlab-compatible approach anyway.)

Cheers,
Andrew

Reply | Threaded
Open this post in threaded view
|

Re: Tablicious in Octave Forge [was: Re: [GSoC 2021] How should I do now with project Table datatype]

apjanke-floss

>> Once the low level data structure is in place (a write-up on "proxy key"
>> would be great),
>
> I'll probably make documenting "proxy key" and related stuff my top
> priority here, then.

I've added some developer documentation for Tablicious:

http://tablicious.janklab.net/

Including stuff on the "proxy key" trick:

http://tablicious.janklab.net/Join-Behavior.html

If anything in the doco isn't clear, or you'd like to see more stuff in
there, drop a request on the repo Issues tracker at:

https://github.com/apjanke/octave-tablicious/issues

Cheers,
Andrew