Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Settings definition different between orginal langdetect and bundle #15

Open
marbleman opened this issue Mar 8, 2016 · 11 comments
Open

Comments

@marbleman
Copy link

Had a hard time today figuring out why my application slowed down around 20 times. After a lot of profiling I found langdetect to be the issue. Finally compared orginal langdetect plugin and the plugin-bundle and wrote a unitTests to measure execution time.

The reason is quite simple: orignal langdetect plugin assumes settings as
langdetect.languages = en,de,fr
while the plugin-bundle wants to see
languages = en,de,fr
in elasticsearch.yml

This applies to all settings (compare src\main\java\org\xbib\elasticsearch\module\langdetect\LangdetectService.java for details)

Is this intended? If yes, I will push an update to the docs...

BTW: I also tried the parameter ?profile=/langdetect/short-text/ since it appeared to me it could speed up detection (probably at cost of accuracy). But in all my tries I always got "profile": "/langdetect/" returned.

@jprante
Copy link
Owner

jprante commented Mar 8, 2016

You're working so hard to find the differences between those two incarnations of the plugin... this helps a lot in aligning them!

Surely differences were never intended, codebases should be the same. The reason why they diverge was focusing on the "bundle" for a more comprehensive installation in my production environment, leaving the "langdetect-plugin" a bit behind. I got some internal feedback for the "bundle" that never made it back to the other version. Sorry for the mess.

BTW there are also some junit tests missing in the "bundle" which are present in "langdetect-plugin".

@marbleman
Copy link
Author

Would have saved a lot of time if I had the idea to compare the two code folders earlier... ;)

Which codebase is intended to be the Master? I guess the single plugins since they carry the most detailed documentation, right?

I am not a Java developer by nature so it will take me a lot of effort and time to set up a functional development environment for all this stuff. I promise, I will do some time ;) Maybe you have a good tip for a starting point/howto. I wrote the unitTest mentioned against the PHP implementation though.

So for now all I can offer is to help with the docs and testing. Is there a way to get notifications on changes similar to code reviews? This would help to check immediately when implementation and documentaion go out of sync. Would rather invest the time here where everyone benefits than spending hours in reverse engineering on issues like the one above... ;)

@jprante
Copy link
Owner

jprante commented Mar 9, 2016

I see you are investing a lot of your time into langdetect right now, so will do the alignment of both codebases in the next hours, in the hope I can clear up the mess a bit. There are parts in both which belong to current state.

Watching a github project should give you notifications about commits, but I'm not sure :(

@marbleman
Copy link
Author

I'll give it a try. Let me know if I can be of any help.

@marbleman
Copy link
Author

BTW: I figured out that reducing the languages to test as described above will leed to wrong results instead of no result or at least a low probability:

E.g. I limit detection to de,en and send in a french text. The result gives me "en" with a probability of 0.99!

@jprante
Copy link
Owner

jprante commented Mar 9, 2016

First commit is here in my alignment quest.

jprante/elasticsearch-langdetect@ba72272

Plugin bundle will follow.

@jprante
Copy link
Owner

jprante commented Mar 9, 2016

So here is the second commit to align both langdetect

48b27ba

@jprante
Copy link
Owner

jprante commented Mar 9, 2016

And another one

jprante/elasticsearch-langdetect@4eaff49

@marbleman
Copy link
Author

Just came across something that confuses me: Thought you had mentioned you wanted to go for ISO-639-1 codes in langdetect (de, en... instead of ger,eng) ?

Current bundle 2.2.0.3 returns ger, eng...

@marbleman
Copy link
Author

Oh, and I stumbled over some details regrading limiting the detected languages in yml that could use some extra documentation but the intention is still a bit unclear to me: I limited detection to de, en because detecting all languages takes too much time. Now I send a russian text in and get a probability of 0.999xxx for either de or en. Would expect a much lower probability or even an empty result instead. Am I wrong?

@marbleman
Copy link
Author

I drilled down on the ISO issue and figured out that the repo already contains a language.json with de/en... while elasticsearch-plugin-bundle-2.2.0.3-plugin.zip still contains the old one with ger/eng...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants