Looking forward to your reply.
To my assigned mentors, Douglas Stewart and Nicholas Jankowski, should we start the discussions about my project, A fast and accurate command line suggestion feature? What time and mode of communication would the two of you prefer? I am comfortable with both, the mailing list and the IRC channel. Other modes of communication such as personal emails, chat messengers such as WhatsApp, WeChat etc. are fine as well. As for the time of communication, I would be comfortable with any time during the summers (i.e. after 21st May), however, these days, the college is still open and so if a synchronized communication is required, I would prefer a time between 7:30am UTC and 8:30 pm UTC. Asynchronous communications will be possible at any time though.
All,I am thankful to all the maintainers and mentors who believed that I am capable enough to contribute to GNU Octave for GSoC 2018. Thank you for accepting my proposal. I promise to meet the expected standards.
On Tue, Apr 24, 2018 at 9:52 AM, Sudeepam Pandey <[hidden email]> wrote:
Here are some ideas for you.
here is an example of a start of a time line etc.
Thank you Doug. I have gone through these links. I'll will inform both of you after I finish the following tasks...
1) Initialize a public repository on Bitbucket for my project.
2) Setup a blog to report the weekly work, write an initial post, and link it to the Bitbucket repository.
It would be really helpful for me if both of you could share your preferred mode and time of communication with me.
Additionally, I would like to inform both of you and the entire Octave community in general, that Shane from octave-online had shared a little more than 100,000 user sessions, and a list of 1000 common misspellings with me via email previously.
I do have some technical doubts, some propositions and some discussions to make. Should we proceed with them here or take them back to bug 46681  ?
On Wed, Apr 25, 2018 at 5:49 PM, Sudeepam Pandey <[hidden email]> wrote:
I think that we should talk here.
Probably best. I'm UTC-4, so most days of the week async communication will be most practical, and email is as good as anything for that.
On Wed, Apr 25, 2018 at 4:55 PM, Doug Stewart <[hidden email]> wrote:
I'd like to add that it's especially important to include in your timeline clear goals for the two midterm evaluations and final evaluation, since the evaluations the mentors will submit to Google will be based on how well you met your goals. Students need to talk with their mentors during the bonding period to make sure you agree on what the goals should be.
Thank you Nir Krakauer for your inputs. The following is an abstract from my GSoC proposal.Q1) I'd like to ask if these goals are good enough with the mentors?
Phase 1 evaluations goal: A set of working Neural Network m-scripts, which, together,
could suggest corrections for typographic errors.
Phase 2 evaluations goal: A development version of Octave which has a command line
suggestion feature (currently there will be no mechanism available to easily select the
corrections suggested and easily enable/disable this feature).
Phase 3 evaluations goal: A development version of Octave with a complete and working
command line suggestion feature, open to feedback and criticisms.
Q2) I would also like to ask that, what happens if I am unable to meet, say phase 1 evaluations goal at the desired time but I end up completing both, the phase 1 and phase 2 goal before the phase 2 evaluations deadline?
On Thu, Apr 26, 2018 at 10:50 PM, Nir Krakauer <[hidden email]> wrote:
In reply to this post by nrjank
So I have set up my blog at  and a public repository at .Other than that, can anyone direct me to a link where all the existing functions of GNU Octave can be found in the form of a list? I know about this link but over here, a description of the function is included with the function name. I essentially require, only the function names and so if I copy anything from this page, I'll have to clean it up first. If a list exists then its good otherwise I'll just proceed with the comprehensive list on that page.
My public repo at bitbucket contains a branch by the name of "Did_you_mean" where I plan to push all the small changes that I make to my project. Then a large change-set can be pushed to the default branch.
a) Please take a look at these and inform me about anything that you'd like changed.
b) Kindly, also inform me about anything that needs to be done after setting up the repo and the blog,
On Thu, Apr 26, 2018 at 5:27 AM, Nicholas Jankowski <[hidden email]> wrote:
On Thu, Apr 26, 2018 at 3:12 PM, Sudeepam Pandey <[hidden email]> wrote:
First: You should be bottom psting or inline posting.
Second: On your timeline you should show about 3-5 steps, on how you are going to meet these goals.
On Thu, Apr 26, 2018 at 3:26 PM, Sudeepam Pandey <[hidden email]> wrote:
Getting a list of all octave commands is one of the steps towards a goal.
getting a list of all package command is another step towards your goals. etc.
On Fri, Apr 27, 2018 at 1:09 AM, Doug Stewart <[hidden email]> wrote:
Thank you for pointing that out, I will follow follow this from now on.
Please take a look at this link: https://docs.google.com/document/d/1mKLA0Yi-kz7sLY591FU-GZA_Pdhz9j4_U-8YMD_5gSw/edit?usp=sharing
It contains the timeline from the final proposal that I had sent in. I have described the required steps in it. Is it alright or should I add something more?
On Fri, Apr 27, 2018 at 1:12 AM, Doug Stewart <[hidden email]> wrote:
Would you like me to make any changes to the bitbucket repository or my blog?
On 04/26/2018 03:12 PM, Sudeepam Pandey wrote:
> Thank you Nir Krakauer for your inputs. The following is an abstract
> from my GSoC proposal.
> Phase 1 evaluations goal*_*:* A set of working Neural Network m-scripts,
> which, together,
> could suggest corrections for typographic errors.
Can you explain how neural networks figure in this task? I've noticed
that recent versions of GCC provide a suggestion feature when
identifiers are not recognized. You might look at how it works. I
think most of the search and matching work is done in the spellcheck.c
and spellcheck-tree.c files:
Does it use neural networks to do that job?
On Fri, Apr 27, 2018 at 2:21 AM, John W. Eaton <[hidden email]> wrote:
On 04/26/2018 03:12 PM, Sudeepam Pandey wrote:
This  is a demo implementation that I had made earlier to demonstrate my idea. I have explained my idea in the README.md file of this repository. The algorithm that GCC has used is the edit distance algorithm which would, in-fact be equally/ slightly more accurate than Neural Networks for the suggestion task but would probably be slower than Neural Networks. My reason for using Neural Networks is that, they should provide 'just the right amount' of accuracy and should be relatively faster.
On Thu, Apr 26, 2018 at 4:47 PM, Sudeepam Pandey <[hidden email]> wrote:
I don't see any ideas as to how to do packages! How are you going to handle the situation, that the user has a package loaded and is typing in a command word from the package?
On Fri, Apr 27, 2018 at 2:41 AM, Doug Stewart <[hidden email]> wrote:
I understand. One thing I can do is to train a single network to understand the spellings of all the functions (Core + Forge) of Octave, whenever a user types in something wrong, the network will then suggest corrections not only from the core but also from the forge package, the user would then select the command(s) suggested and if the package is missing/not loaded, the appropriate error (missing package/package not loaded) will be shown to the user just like it is shown currently. However, I understand that it may be a waste of computational resources to consider the packages that have not been installed/loaded. I'll try to think of something better and will inform you via email, and update the timeline when I do.
On Thu, Apr 26, 2018 at 5:11 PM, Doug Stewart <[hidden email]> wrote:
was going to be my second question. Is there a good way to get a list of all loaded functions? All installed functions? personally not sure.
my first question relates to what John mentioned, and something I think I suggested somewhere else before: there are other suggestion features out there, including FOSS implementations. I like the novelty of the neural net approach for building up, say, a pretrained suggestion lookup ("autocorrect") table, and a live algorithm that will suggest for something not in the predefined table. But surely others have taken different approaches to spellcheck, autocorrect, suggestions, etc., from decision trees and error estimators to inline and gui interface approaches. What else is out there already?
I think a very important part of this is not just to start implementing your code, but to review a bit of what's out there, decide what's good & bad, and what's worth borrowing when code compatibility exists.
As a person who researches in the area of machine learning I see many issues with using this for Octave. An immediate thought is “are we trying to adapt a problem to the solution?” Neural networks have issues like fixed input/output size (which you already ran into), model capacity, bias (as in mathematical bias), and catastrophic forgetting (with continual learning).
Doug brings up a great point, the only way NN would be feasible with dynamic loading of packages would be to either use continual learning (which might not work with one-hot encodings?), or retrain the model on package loading. But at some point we are also going to have increase model capacity in order to accomidate new function names as in the case. Also, you appear to be encoding your inputs as ASCII which is probably less than ideal given that assumes that inputs into your feature space are closer if their ASCII distance is closer. In natural language processing we usually use word embeddings but since the names of functions aren’t really natural it probably would apply. You could use LSTM or convolutions which may assist you with your input length problem but those are just going to slow you down and make it less feasible since you’re doing this all in Octave.
An smart implementation of edit distance would likely better serve Octave. Using assumptions about the errors that users would commonly make would allow you to lower your search space and since reasonable implemenations of edit distance are \theta(mn) with m and n being the target string lengths then we only have to multiply it by the number of functions, or perhaps names, in Octave. With some clever tricks you can reduce the number of functions you have to check in this scheme (usually most people know the first character of the function they want for instance might be sufficient, or only look at words that are similar in length, these are things you’d want to research first anyway, I’m sure it is a solved problem).
As the model capacity increases we also have to look at the complexity of the model as far as computation as well. Retraining the network with more examples will take longer and as we discovered I don’t think we can use a pretrained network.
So in summary the problems are:
- Fixed input/output size
- Training time
- Dynamic names
- Simpler solutions exist
- Model encoding is poor
I think the most common issue is going to be dynamic names, given we are using training examples from an external source they would have to be packaged somehow with the distro.
Some mixed method might be interesting but again I think a well designed edit-distance would serve better.
On Fri, Apr 27, 2018 at 3:23 AM, Bradley Kennedy <[hidden email]> wrote:
Edit distance, indeed is an excellent algorithm for this job but I disregarded it, only because a 'smart' implementation would require some really important assumptions. Like for instance, assuming that the user will always type the first character correctly.
I have also gone through the problems that you have stated about neural networks. Most of these can also be solved with some reasonable assumptions, and I have concluded that the end decision about whether to use neural nets or edit distance ultimately lies with what assumptions do we consider more practical.
Examples of solutions to the problems you stated...
For the Input size problem, it is reasonable to assume that the user would at best, misspell a word with not more 3 extra characters, so we can keep the input layer size = 3 + longest length function of Octave.
Due to ASCII encoding, something like this will happen... say we have 2 proper words, 'amuse' and 'abuse', and someone types in 'atuse'. In such a case, the network by default would output 'amuse' because ASCII of 't' is closer to 'm' than it is to 'b'. I did not considered this to be a problem because essentially the network really is giving us 'the closest match', however, even if we do consider this to be a problem, considering all the classes within a probability range instead of the class of highest probability would solve it.
Just like we would require to add new function names to the edit distance check list, with each new Octave release, we would require to retrain a neural network to incorporate new functions with each release which is doable since we will be the ones making these changes at every release. The user won't be affected by this. Training time is a problem but again, training time won't be so large as to delay a release so its doable, obviously using GPU services is also an option. The increase in the complexity of the model with the increase in the capacity is real but likewise the edit distance algorithm will also face a similar increase in the complexity because it will have more words to look at.
The only real problem, in my view, with neural networks is the dynamic package loading. That, as I said, 'could be solved' by using a large neural network that incorporates all the existing functions of Octave (Core + Forge). I accept the fact that it may not be the most optimal solution though.
I think it would be the best if we come up with a solution to the problem with both the approaches. Then we can look at both the approaches and mutually decide what method takes in more practical assumptions and which one is more optimized.
On Thu, Apr 26, 2018 at 7:12 PM, Sudeepam Pandey <[hidden email]> wrote:
I would suggest qwerty keyboard distance if we're trying to catch the majority of fumble-finger typos. then again that would exclude people like me typing on Dvorak...
|Free forum by Nabble||Edit this page|