Document already exist (http 409) cases document improvements. #16117

mashhurs · 2024-05-01T16:17:56Z

Tell us about the issue

Description:
There are various situation where ES may reject the event with document already exist. Purpose of this issue to collect such cases and add a short documentation (under the whichever suitable place, in troubleshooting or support doc or es-output) as we are getting same question over and over.

Ingestion from agent
Possibly two cases I can think of now:

events are datastream with integration and integration has a fingerprint processor which sets the document _id. For example, tenable_sc integration may have logs-tenable_sc.vulnerability-{version} & logs-tenable_sc.plugin-{version} ingest pipelines which have fingerprint sets the _id:

{
 "fingerprint": {
   "fields": [
     "json.lastSeen",
     "json.pluginID",
     "json.ip",
     "json.uuid",
     "json.firstSeen",
     "json.lastSeen",
     "json.exploitAvailable",
     "json.vulnPubDate",
     "json.patchPubDate",
     "json.pluginPubDate",
     "json.pluginModDate",
     "json.pluginText",
     "json.dnsName",
     "json.macAddress",
     "json.operatingSystem",
     "json.pluginInfo"
   ],
   "target_field": "_id",
   "ignore_missing": true
 }
},

Example log when Logstash receives a rejected event:

 [2024-04-24T14:13:25,988][WARN ][logstash.outputs.elasticsearch][Elastic-Agent-to-Logstash].[6ef18a6008d3cea8f01e0cd409c22213845cff0829c5f76b02d18a30a22c024d] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-windows.service-default", :routing=>nil}, {"host"=>{"mac"=>["...", "..."], "name"=>"redacted-name", "ip"=>["1", "2", "3"], "architecture"=>"x86_64", "id"=>"aae15", "os"=>{"name"=>"Windows Server 2016 Datacenter", "platform"=>"windows", "type"=>"windows", "kernel"=>"10.0.14393.6897 (rs1_release.240404-1613)", "family"=>"windows", "build"=>"14393.6897", "version"=>"10.0"}, "hostname"=>"redacted-host"}, "service"=>{"type"=>"windows"}, "elastic_agent"=>{"version"=>"8.11.3", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "snapshot"=>false}, "metricset"=>{"name"=>"service", "period"=>60000}, "event"=>{"duration"=>211345500, "module"=>"windows", "dataset"=>"windows.service"}, "cloud"=>{"service"=>{"name"=>"redacted-name"}, "instance"=>{"id"=>"redacted-id", "name"=>"redacted-name"}, "provider"=>"openstack", "machine"=>{"type"=>"t"}, "availability_zone"=>"zone"}, "@timestamp"=>2024-04-24T14:12:24.991Z, ...., "type"=>"metricbeat", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "ephemeral_id"=>"fdd26638-3865-4e63-961a-69d8f419538a", "version"=>"8.11.3"}, "@version"=>"1"}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[abcd][{agent.id=my-agent-id, cloud.availability_zone=zone, cloud.instance.id=i-id, windows.service.pid=996, windows.service.state=Running}@2024-04-24T14:12:24.991Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"wBkYL16ZSo2fJ7tAC3zbbw", "shard"=>"0", "index"=>".ds-metrics-windows.service-default-2024.04.22-000065"}}}}

Logstash is having a backpressure where it cannot acknowledge the events to agent, as a result agent timeouts and resends the event. In a reality events might be indexed already in the ES. Quick resolution would be extending agent timeout but may depend on the situation.
etc.

URL:

Example: https://www.elastic.co/guide/en/logstash/current/introduction.html

Anything else?

The text was updated successfully, but these errors were encountered:

robbavey · 2024-05-03T17:40:38Z

There is another potential cause for these 409 conflicts.

When integrations write to a TSDS enabled index, the document id is defined as "a hash of the document’s dimensions and @timestamp".

The document's dimensions are defined in the integration, and when events are sent at a frequency > 1 per millisecond, and the dimensions are insufficient to disambiguate those events, a version conflict will arise.
This has already been seen in the integrations for the elastic agent and mysql, and I suspect there are more that can cause the issue

mashhurs · 2024-06-11T18:44:55Z

Had created a doc for fingerpint case but not sure if we have a recommended resolution for TSDS case: https://github.com/mashhurs/logstash/blob/docs-409-issue/docs/static/troubleshoot/ts-elasticsearch.asciidoc

mashhurs · 2024-08-27T20:41:09Z

Mostly, we get document level http-409 with following cases:

1. Customers are manually setting document _id (not specific to agent -> LS)

They need to ensure _id uniqueness. If you see document_id setting (maybe with a dynamic sprintf style like "%{[@metadata][id]}") in es-output plugin of pipeline configs, most likely this is a cause. This can be double confirmed if you see exact doc ID, {:_id=>supposed-to-be-unique-id, ....

2. [elastic-agent -> LS] Logstash is experiencing backpressure

When {ls} faces backpressure, it cannot acknowledge the events back to elastic-agent and as a result agent timeouts, resends the events.
However, previously sent events are (fully or partially) indexed into ES.
From the agent side, extending timeout in elastic-agent configuration may improve the situation.
However, it is highly recommended to

figure out where is the slow performant of the pipeline. This can be identified by taking a look at pipeline and plugin level flow stats, especially worker_unitilization and queue_backpressure : https://www.elastic.co/guide/en/logstash/current/node-stats-api.html#flow-stats. If ES is a slow performant, better ask ES team to help.
address the bottleneck by tuning the LS pipeline, see tuning Logstash: https://www.elastic.co/guide/en/logstash/current/tuning-logstash.html

3. [elastic-agent -> LS] Time series data stream (TSDS) based integration

_id for TSDS documents is a hash value, calculated by integrations based on https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension[document's dimentions] and @timestamp. Depending on dimension granularity and the time frequency ES receives documents, _id might not be unique.
Based on TSDS guide, it can be identified if index is time_series or not by taking a look at its index template (index.mode: time_series). In the index template, we can also see which fields are dimensions ("time_series_dimension": true). You may need to analyze figuring out dimension + @timestamp uniqueness.

With this symptom, I haven't seen the _id is explicitly set, it is mostly nil (see the error example below). Reproduced fresh example of error:

[2024-08-26T15:11:43,587][WARN ][logstash.outputs.elasticsearch][main][5da2de1a6015f56bca61a3878c1e397ea47ce4da8031f78e07087f62e900c6c6] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-system.network-default", :routing=>nil}, {"@timestamp"=>2024-08-26T21:48:48.268Z, "agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-98df506ad305", "type"=>"metricbeat", "ephemeral_id"=>"9f7fc9b8-a93c-4252-8312-41668e9bb310", "name"=>"mashhurs-host, "version"=>"8.14.1"}, "system"=>{"network"=>{"out"=>{"packets"=>1, "dropped"=>0, "bytes"=>100, "errors"=>0}, "name"=>"utun0", "in"=>{"packets"=>0, "dropped"=>0, "bytes"=>0, "errors"=>0}}}, "service"=>{"type"=>"system"}, "metricset"=>{"name"=>"network", "period"=>1000}, "ecs"=>{"version"=>"8.0.0"}, "data_stream"=>{"namespace"=>"default", "type"=>"metrics", "dataset"=>"system.network"}, "tags"=>["beats_input_raw_event"], "event"=>{"duration"=>30177083, "dataset"=>"system.network", "module"=>"system"}, "@version"=>"1", "elastic_agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-....", "snapshot"=>false, "version"=>"8.14.1"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[X2S_eqC1kCpFpb1lAAABkZCpuww][LI_yqguJnd4yG2gWc-sq5af-yVYaXIwj-eRIWXnMt35ym4EgBCfmRK6iivMF@2024-08-26T21:48:48.268Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"YJE1kdgbTvuNeS23sbDVQg", "shard"=>"0", "index"=>".ds-metrics-system.network-default-2024.08.26-000003"}}}}

4. Can be ignored
es-output has a silence_errors_in_log option specifically useful for this situations. Since the doc is already indexed, it is up to your preference to ignore it.
Example config:

    output {
      elasticsearch {
        silence_errors_in_log => ["version_conflict_engine_exception"]
      }
    }

mashhurs added docs status:needs-triage labels May 1, 2024

mashhurs self-assigned this May 1, 2024

mashhurs linked a pull request Aug 27, 2024 that will close this issue

Docs "http-409, version conflict, document already exists" issue #16402

Draft

5 tasks

mashhurs closed this as completed Aug 27, 2024

jsvd removed the status:needs-triage label Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document already exist (http 409) cases document improvements. #16117

Document already exist (http 409) cases document improvements. #16117

mashhurs commented May 1, 2024 •

edited

Loading

robbavey commented May 3, 2024

mashhurs commented Jun 11, 2024

mashhurs commented Aug 27, 2024

Document already exist (http 409) cases document improvements. #16117

Document already exist (http 409) cases document improvements. #16117

Comments

mashhurs commented May 1, 2024 • edited Loading

Tell us about the issue

robbavey commented May 3, 2024

mashhurs commented Jun 11, 2024

mashhurs commented Aug 27, 2024

mashhurs commented May 1, 2024 •

edited

Loading