Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated for Elasticsearch Version 6.4+ incl. Caching #17

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,9 @@ Currently, only baseforms for german and english are implemented.

Example: the german base form of `zurückgezogen` is `zurückziehen`.

## Versions

| Plugin | Elasticsearch | Release date |
| --------- | --------------- | -------------|
| 2.2.1.1 | 2.2.1 | Jun 22, 2016 |
| 2.2.1.0 | 2.2.1 | Apr 23, 2016 |
| 1.4.0.0 | 1.4.0 | Feb 19, 2015 |
| 1.3.0.0 | 1.3.1 | Jul 30, 2014 |

## Installation

### Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/2.2.1.1/elasticsearch-analysis-baseform-2.2.1.1-plugin.zip

### Elasticsearch 1.x

./bin/plugin -install analysis-baseform -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-analysis-baseform/1.4.0.0/elasticsearch-analysis-baseform-1.4.0.0-plugin.zip

Do not forget to restart the node after installing.
Use Gradle to build the plugin and install it using the elasticsearch-plugin command. Check the "gradle.properties" for the supported version.

## Project docs

Expand Down Expand Up @@ -61,11 +44,14 @@ In the settings, set up a token filter of type "baseform" and language "de"::
}

By using such a tokenizer, the sentence
"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet"

"Die Jahresfeier der Rechtsanwaltskanzleien auf dem Donaudampfschiff hat viel Ökosteuer gekostet"

will be tokenized into
"Die", "Die", "Jahresfeier", "Jahresfeier", "der", "der", "Rechtsanwaltskanzleien", "Rechtsanwaltskanzlei",
"auf", "auf", "dem", "der", "Donaudampfschiff", "Donaudampfschiff", "hat", "haben", "viel", "viel",
"Ökosteuer", "Ökosteuer", "gekostet", "kosten"

"Die", "Die", "Jahresfeier", "Jahresfeier", "der", "der", "Rechtsanwaltskanzleien", "Rechtsanwaltskanzlei",
"auf", "auf", "dem", "der", "Donaudampfschiff", "Donaudampfschiff", "hat", "haben", "viel", "viel",
"Ökosteuer", "Ökosteuer", "gekostet", "kosten"

It is recommended to add the [Unique token filter](http://www.elasticsearch.org/guide/reference/index-modules/analysis/unique-tokenfilter.html) to skip tokens that occur more than once.

Expand Down Expand Up @@ -115,6 +101,18 @@ this token stream will be produced::

As an alternative, separate dictionaries for `en-verbs` and `en-nouns` are available.

## Caching

The time consumed by the baseform computation may increase your overall indexing time drastically if applied in the billions. You can configure the cache size (in number of entries) for mapping a token to an array of baseform tokens.
Reaching the cache size limit results in clearing of the cache and starting anew. This setting and the cache respectively is applied to a node, so configure it in the elasticsearch.yml file:

```
# default: 8388608 entries
# minimum: 131072 entries
# baseform_max_cache_size: 8388608
```


# License

Elasticsearch Baseform Analysis Plugin
Expand Down Expand Up @@ -148,3 +146,5 @@ and is distributed under CC-BY-SA http://creativecommons.org/licenses/by-sa/3.0/
The english baseforms are a modified version of the english.dict file
of http://languagetool.org/download/snapshots/LanguageTool-20131115-snapshot.zip
which is licensed under LGPL http://www.fsf.org/licensing/licenses/lgpl.html#SEC1

GBI-Genios Deutsche Wirtschaftsdatenbank GmbH for adding the caching-functionality.
129 changes: 58 additions & 71 deletions build.gradle
Original file line number Diff line number Diff line change
@@ -1,60 +1,40 @@
group = 'org.xbib.elasticsearch.plugin'
version = '2.2.0.0'

ext {
pluginName = 'baseform'
pluginClassname = 'org.xbib.elasticsearch.plugin.baseform.AnalysisBaseformPlugin'
pluginDescription = 'Baseform plugin for Elasticsearch'
user = 'jprante'
name = 'elasticsearch-analysis-baseform'
scmUrl = 'https://github.com/' + user + '/' + name
scmConnection = 'scm:git:git://github.com/' + user + '/' + name + '.git'
scmDeveloperConnection = 'scm:git:git://github.com/' + user + '/' + name + '.git'
versions = [
'elasticsearch' : '2.2.0',
'log4j': '2.5',
'junit' : '4.12'
]
}

println "Host: " + java.net.InetAddress.getLocalHost()
println "Gradle: " + gradle.gradleVersion + " JVM: " + org.gradle.internal.jvm.Jvm.current() + " Groovy: " + GroovySystem.getVersion()
println "Build: group: '${project.group}', name: '${project.name}', version: '${project.version}'"
println "Timestamp: " + java.time.Instant.now().atZone(java.time.ZoneId.systemDefault()).toString()

buildscript {
repositories {
mavenLocal()
mavenCentral()
jcenter()
maven {
url "http://xbib.org/repository"
}
}
dependencies {
classpath 'org.ajoberstar:gradle-git:1.4.2'
classpath 'co.riiid:gradle-github-plugin:0.4.2'
classpath 'io.codearte.gradle.nexus:gradle-nexus-staging-plugin:0.5.3'
}
plugins {
id "org.sonarqube" version "2.5"
id "org.xbib.gradle.plugin.asciidoctor" version "1.5.4.1.0"
id "io.codearte.nexus-staging" version "0.7.0"
}


printf "Host: %s\nOS: %s %s %s\nJVM: %s %s %s %s\nGroovy: %s\nGradle: %s\n" +
"Build: group: ${project.group} name: ${project.name} version: ${project.version}\n",
InetAddress.getLocalHost(),
System.getProperty("os.name"),
System.getProperty("os.arch"),
System.getProperty("os.version"),
System.getProperty("java.version"),
System.getProperty("java.vm.version"),
System.getProperty("java.vm.vendor"),
System.getProperty("java.vm.name"),
GroovySystem.getVersion(),
gradle.gradleVersion

apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'signing'
apply plugin: 'co.riiid.gradle'
apply plugin: 'findbugs'
apply plugin: 'pmd'
apply plugin: 'checkstyle'
apply plugin: "jacoco"
apply plugin: 'org.xbib.gradle.plugin.asciidoctor'

repositories {
mavenCentral()
mavenLocal()
jcenter()
maven {
url "http://xbib.org/repository"
}
}

configurations {
wagon
releaseJars {
distJars {
extendsFrom runtime
exclude group: 'org.elasticsearch'
exclude module: 'lucene-core'
Expand All @@ -63,27 +43,41 @@ configurations {
exclude module: 'jackson-core'
exclude module: 'jackson-dataformat-smile'
exclude module: 'jackson-dataformat-yaml'
exclude module: 'log4j-api'
}
}

apply from: 'gradle/ext.gradle'
apply from: 'gradle/publish.gradle'
apply from: 'gradle/sonarqube.gradle'

dependencies {
compile "org.elasticsearch:elasticsearch:${versions.elasticsearch}"
testCompile "junit:junit:${versions.junit}"
testCompile "org.apache.logging.log4j:log4j-slf4j-impl:${versions.log4j}"
testCompile "org.apache.logging.log4j:log4j-core:${versions.log4j}"
releaseJars "${project.group}:${project.name}:${project.version}"
wagon 'org.apache.maven.wagon:wagon-ssh-external:2.10'
def without_hamcrest = {
exclude group: 'org.hamcrest', module: 'hamcrest-core'
}
compile "org.elasticsearch:elasticsearch:${project.property('elasticsearch.version')}"
compile "org.apache.logging.log4j:log4j-api:${project.property('log4j.version')}"
testCompile "junit:junit:${project.property('junit.version')}", without_hamcrest
testCompile "org.apache.logging.log4j:log4j-core:${project.property('log4j.version')}"
testCompile "org.elasticsearch.plugin:transport-netty4-client:${project.property('elasticsearch.version')}"
testCompile "org.elasticsearch.test:framework:${project.property('elasticsearch.version')}"
testCompile "org.codelibs.elasticsearch.module:analysis-common:${project.property('elasticsearch.version')}"
distJars "${project.group}:${project.name}:${project.version}"
wagon "org.apache.maven.wagon:wagon-ssh:${project.property('wagon.version')}"
}

sourceCompatibility = 1.7
targetCompatibility = 1.7
sourceCompatibility = JavaVersion.VERSION_1_8
targetCompatibility = JavaVersion.VERSION_1_8

[compileJava, compileTestJava]*.options*.encoding = 'UTF-8'
tasks.withType(JavaCompile) {
options.compilerArgs << "-Xlint:unchecked,deprecation"
options.compilerArgs << "-Xlint:all" << "-profile" << "compact2"
}

test {
systemProperty 'path.home', projectDir.absolutePath
systemProperties['path.home'] = System.getProperty("user.dir")
systemProperties['tests.security.manager'] = false

testLogging {
showStandardStreams = false
exceptionFormat = 'full'
Expand All @@ -95,28 +89,26 @@ task makePluginDescriptor(type: Copy) {
into 'build/tmp/plugin'
expand([
'descriptor': [
'name': pluginName,
'classname': pluginClassname,
'description': pluginDescription,
'jvm': true,
'site': false,
'isolated': true,
'version': project.property('version'),
'javaVersion': project.property('targetCompatibility'),
'elasticsearchVersion' : versions.elasticsearch
'name': project.property('pluginName'),
'classname': project.property('pluginClassname'),
'description': project.property('pluginDescription'),
'version': project.property('version'),
'javaVersion': project.property('targetCompatibility'),
'elasticsearchVersion' : project.property('elasticsearch.version')
]
])
}

task buildPluginZip(type: Zip, dependsOn: [':jar', ':makePluginDescriptor']) {
from configurations.releaseJars
from configurations.distJars
from 'build/tmp/plugin'
//into 'elasticsearch'
classifier = 'plugin'
}

task unpackPlugin(type: Copy, dependsOn: [':buildPluginZip']) {
delete "plugins"
from configurations.releaseJars
from configurations.distJars
from 'build/tmp/plugin'
into "plugins/${pluginName}"
}
Expand All @@ -140,16 +132,11 @@ task sourcesJar(type: Jar, dependsOn: classes) {
}

artifacts {
archives javadocJar, sourcesJar, buildPluginZip
archives sourcesJar, javadocJar, buildPluginZip
}

if (project.hasProperty('signing.keyId')) {
signing {
sign configurations.archives
}
}

ext.grgit = org.ajoberstar.grgit.Grgit.open()

apply from: 'gradle/git.gradle'
apply from: 'gradle/publish.gradle'
Loading