Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decompound filter returns non-compound words twice #9

Open
ackermann opened this issue Dec 9, 2015 · 2 comments
Open

decompound filter returns non-compound words twice #9

ackermann opened this issue Dec 9, 2015 · 2 comments

Comments

@ackermann
Copy link

First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).

This is the simplified version of my index settings to reproduce the problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder]
            filter:
                decompounder:
                    type: decompound

Querying /_analyze with the text Grundbuchamt Anwältin returns:

tokens:
- token: "Grundbuchamt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Grund"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "buch"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "amt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1

As you can see, the token Anwältin is returned twice with the same offset and position.

(Setting subwords_only to true eliminates the duplicates by the way.)

Do you have an idea how we might fix this behaviour?

@ackermann ackermann reopened this Dec 9, 2015
@jprante
Copy link
Owner

jprante commented Dec 9, 2015

There may be a flaw. As a workaround, removing duplicates from token stream can be performed by a standard "unique" filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

@ackermann
Copy link
Author

Thanks! I just came back to post this as well. What's important to note is that the unique filter should be used with only_on_same_position: true, because otherwise the term frequency will be heavily distorted.

As an example for others with the same problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder, unique_decomp]
            filter:
                unique_decomp:
                    type: unique
                    only_on_same_position: true
                decompounder:
                    type: decompound

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants