Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

baseform: less word forms returned than defined in the resource #31

Open
nkrot opened this issue Apr 6, 2017 · 2 comments
Open

baseform: less word forms returned than defined in the resource #31

nkrot opened this issue Apr 6, 2017 · 2 comments

Comments

@nkrot
Copy link

nkrot commented Apr 6, 2017

Situation: The baseform resource de-lemma-utf8.txt defines various outcomes for one input word, for example,

Zuschlage	Zuschlag
Zuschlage	zuschlagen

I would expect that all outcomes will be returned, as the correct baseform depends on the part of speech.

If the resource is used case-insensitively, the number of such collisions will increase, now comprising cases like:

Gefahren	Gefahr
gefahren	fahren

Would it be possible to fix the plugin to return all entries given in the resource?

Thanx

@jprante
Copy link
Owner

jprante commented Apr 6, 2017

That's a bug, on left column in de-lemma-utf8.txt, every word should occur at most once.

Part-of-speech is out of scope of the baseform token filter. For this, a wordnet-like input would be required with an NLP plugin (for POS tagging).

@nkrot
Copy link
Author

nkrot commented Apr 6, 2017

Hopefully you agree that a single word form can be transformed into 1+ baseforms. This is the main idea of my initial post: if no PoS information is available, it is reasonable to assume any PoS and produce all possible base forms. Here you are an example of two different lemmata having the same derived forms:

leaves       leaf
leaves       leave

If the left column is supposed to contain unique words only, how will multiple outcomes be given? Like this:

Zuschlage     Zuschlag,zuschlagen

It is also possible to accomplish such merging at load/compile time. This way it is a little bit easier for the the users who may want to update the resource.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants