Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect language and correct accordingly #15

Open
rymai opened this issue Oct 1, 2015 · 11 comments
Open

Detect language and correct accordingly #15

rymai opened this issue Oct 1, 2015 · 11 comments

Comments

@rymai
Copy link

rymai commented Oct 1, 2015

I received a pull-request on rymai/elevator-simulator#1 where "attendent" was mistaken for "attendant". The problem is that the README is in French, and in that case, "ils attendent" means "they are waiting".

@uiteoi
Copy link

uiteoi commented Oct 1, 2015

Your README mixes both English and French, which will make it extra hard to detect language.

You are probably not alone mixing languages as this can happen for a number of reasons.

orthographic-pedant could use a number of methods to fix this one:

  • blacklisting
  • language settings in orthographic-pedant configuration files
  • language settings embedded in md files using language tags

@rymai
Copy link
Author

rymai commented Oct 1, 2015

Thanks @uiteoi. Indeed the readme mixes both languages, but at least the script could stop its work in that case instead of proposing a wrong correction...

I certainly don't want to add a configuration file for this bot, nor add settings in md files. :)

@uiteoi
Copy link

uiteoi commented Oct 1, 2015

Proper detection is certainly the best way going forward. Considering the complexity of implementation I was considering other options. In your case, blacklisting would be the most appropriate short-term solution.

@rymai
Copy link
Author

rymai commented Oct 1, 2015

Exactly!

@thoppe
Copy link
Owner

thoppe commented Oct 1, 2015

What I've found is that explict white-listing is the way to go. I made a few early mistakes correcting ` Ceasar to Caesar and had half of Latin-America mad at me. For this particular case I'm going to remove this word from the correcting list. I've only done the A's so far, you can see what corrections will be attempted here:

https://github.com/thoppe/orthographic-pedant/blob/master/wordlists/parsed_wikipedia_list.txt

A poor-man's check for a possible foreign language would check if the entire README could be converted to ASCII without loss. Obviously this is a bit heavy handed, but I'm not sure how this problem is solved in the real-world.

@thoppe thoppe closed this as completed in 3f53216 Oct 1, 2015
@uiteoi
Copy link

uiteoi commented Oct 1, 2015

@thoppe, you are going to have this same problem with countless other words, French and English in particular share countless words with slightly different spellings. e.g. example / exemple, appartement / apartment, ...

So I would suggest that you start looking for some form of detection and ease the possibility to blacklist repos.

Good luck with your project.

@uiteoi
Copy link

uiteoi commented Oct 1, 2015

Maybe another possible suggestion, if some repo owner rejected a pull-request once, you may want to blacklist that repo automatically to avoid submitting further suggested fixes.

@thoppe
Copy link
Owner

thoppe commented Oct 1, 2015

Good suggestions @uiteoi. Since I don't speak French, is there a list of "homophonic cognates" somewhere that you can vouch for as a good starting point?

Natural language is deceptively hard to get right, especially when I have to cross the phase boundary between two of them!

As a side-note, many happy users reject a PR by accident since they are unfamiliar with githubs PR system. Ad-hoc, this amounts to about 5%. Very few people vehemently dislike the bot (but that number is not zero).

@uiteoi
Copy link

uiteoi commented Oct 1, 2015

Here's a wikipedia article showing a list of common spelling mistakes in French, it is used by the WPCleaner bot to detect spelling mistakes.

https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Liste_de_fautes_d%27orthographe_courantes

You should expect that most major language would have similar lists for the WPCleaner bot to use.

I see you are using Python which comes with a number of NL libraries using NLTK. Here's an example I found by googling "python natural language detection":
https://pypi.python.org/pypi/guess-language

For people who reject a PR by accident, they should be able to submit a PR on your repo to get removed from the blacklist.

I personally think this is a great project and I encourage you to further develop it.

@thoppe
Copy link
Owner

thoppe commented Oct 9, 2015

I'm going to reopen this issue since it turns out this is a really good idea. It shouldn't be too hard to detect if the language is not English and skip the repo outright. This should help with the words that are correct in French and English at least.

@thoppe thoppe reopened this Oct 9, 2015
@uiteoi
Copy link

uiteoi commented Oct 10, 2015

Great 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants