decompound filter returns non-compound words twice #9

ackermann · 2015-12-09T15:24:17Z

First of all: Thanks for creating this enormously helpful bundle! While fine-tuning it for our application, I've stumbled upon the following problem: The decompound filter correctly returns the subwords of compound words but returns every word that's not a compound word twice (i.e. it treats the compound word as a single subword of itself).

This is the simplified version of my index settings to reproduce the problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder]
            filter:
                decompounder:
                    type: decompound

Querying /_analyze with the text Grundbuchamt Anwältin returns:

tokens:
- token: "Grundbuchamt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Grund"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "buch"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "amt"
  start_offset: 0
  end_offset: 12
  type: "<ALPHANUM>"
  position: 0
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1
- token: "Anwältin"
  start_offset: 13
  end_offset: 21
  type: "<ALPHANUM>"
  position: 1

As you can see, the token Anwältin is returned twice with the same offset and position.

(Setting subwords_only to true eliminates the duplicates by the way.)

Do you have an idea how we might fix this behaviour?

The text was updated successfully, but these errors were encountered:

jprante · 2015-12-09T16:12:40Z

There may be a flaw. As a workaround, removing duplicates from token stream can be performed by a standard "unique" filter https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-unique-tokenfilter.html

ackermann · 2015-12-09T17:40:24Z

Thanks! I just came back to post this as well. What's important to note is that the unique filter should be used with only_on_same_position: true, because otherwise the term frequency will be heavily distorted.

As an example for others with the same problem:

settings:
    index:
        analysis:
            analyzer:
                german_analyzer:
                    type: custom
                    tokenizer: standard
                    filter: [decompounder, unique_decomp]
            filter:
                unique_decomp:
                    type: unique
                    only_on_same_position: true
                decompounder:
                    type: decompound

ackermann closed this as completed Dec 9, 2015

ackermann reopened this Dec 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decompound filter returns non-compound words twice #9

decompound filter returns non-compound words twice #9

ackermann commented Dec 9, 2015

jprante commented Dec 9, 2015

ackermann commented Dec 9, 2015

decompound filter returns non-compound words twice #9

decompound filter returns non-compound words twice #9

Comments

ackermann commented Dec 9, 2015

jprante commented Dec 9, 2015

ackermann commented Dec 9, 2015