Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document already exist (http 409) cases document improvements. #16117

Closed
mashhurs opened this issue May 1, 2024 · 3 comments · May be fixed by #16402
Closed

Document already exist (http 409) cases document improvements. #16117

mashhurs opened this issue May 1, 2024 · 3 comments · May be fixed by #16402
Assignees
Labels

Comments

@mashhurs
Copy link
Contributor

mashhurs commented May 1, 2024

Tell us about the issue

Description:
There are various situation where ES may reject the event with document already exist. Purpose of this issue to collect such cases and add a short documentation (under the whichever suitable place, in troubleshooting or support doc or es-output) as we are getting same question over and over.

  • Ingestion from agent
    Possibly two cases I can think of now:
    • events are datastream with integration and integration has a fingerprint processor which sets the document _id. For example, tenable_sc integration may have logs-tenable_sc.vulnerability-{version} & logs-tenable_sc.plugin-{version} ingest pipelines which have fingerprint sets the _id:
    {
     "fingerprint": {
       "fields": [
         "json.lastSeen",
         "json.pluginID",
         "json.ip",
         "json.uuid",
         "json.firstSeen",
         "json.lastSeen",
         "json.exploitAvailable",
         "json.vulnPubDate",
         "json.patchPubDate",
         "json.pluginPubDate",
         "json.pluginModDate",
         "json.pluginText",
         "json.dnsName",
         "json.macAddress",
         "json.operatingSystem",
         "json.pluginInfo"
       ],
       "target_field": "_id",
       "ignore_missing": true
     }
    },
    

Example log when Logstash receives a rejected event:

 [2024-04-24T14:13:25,988][WARN ][logstash.outputs.elasticsearch][Elastic-Agent-to-Logstash].[6ef18a6008d3cea8f01e0cd409c22213845cff0829c5f76b02d18a30a22c024d] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-windows.service-default", :routing=>nil}, {"host"=>{"mac"=>["...", "..."], "name"=>"redacted-name", "ip"=>["1", "2", "3"], "architecture"=>"x86_64", "id"=>"aae15", "os"=>{"name"=>"Windows Server 2016 Datacenter", "platform"=>"windows", "type"=>"windows", "kernel"=>"10.0.14393.6897 (rs1_release.240404-1613)", "family"=>"windows", "build"=>"14393.6897", "version"=>"10.0"}, "hostname"=>"redacted-host"}, "service"=>{"type"=>"windows"}, "elastic_agent"=>{"version"=>"8.11.3", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "snapshot"=>false}, "metricset"=>{"name"=>"service", "period"=>60000}, "event"=>{"duration"=>211345500, "module"=>"windows", "dataset"=>"windows.service"}, "cloud"=>{"service"=>{"name"=>"redacted-name"}, "instance"=>{"id"=>"redacted-id", "name"=>"redacted-name"}, "provider"=>"openstack", "machine"=>{"type"=>"t"}, "availability_zone"=>"zone"}, "@timestamp"=>2024-04-24T14:12:24.991Z, ...., "type"=>"metricbeat", "id"=>"48423d1c-5a87-46ef-b6a8-baf90b515e63", "ephemeral_id"=>"fdd26638-3865-4e63-961a-69d8f419538a", "version"=>"8.11.3"}, "@version"=>"1"}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[abcd][{agent.id=my-agent-id, cloud.availability_zone=zone, cloud.instance.id=i-id, windows.service.pid=996, windows.service.state=Running}@2024-04-24T14:12:24.991Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"wBkYL16ZSo2fJ7tAC3zbbw", "shard"=>"0", "index"=>".ds-metrics-windows.service-default-2024.04.22-000065"}}}}
  • Logstash is having a backpressure where it cannot acknowledge the events to agent, as a result agent timeouts and resends the event. In a reality events might be indexed already in the ES. Quick resolution would be extending agent timeout but may depend on the situation.

  • etc.

URL:

Example: https://www.elastic.co/guide/en/logstash/current/introduction.html

Anything else?

@robbavey
Copy link
Member

robbavey commented May 3, 2024

There is another potential cause for these 409 conflicts.

When integrations write to a TSDS enabled index, the document id is defined as "a hash of the document’s dimensions and @timestamp".

The document's dimensions are defined in the integration, and when events are sent at a frequency > 1 per millisecond, and the dimensions are insufficient to disambiguate those events, a version conflict will arise.
This has already been seen in the integrations for the elastic agent and mysql, and I suspect there are more that can cause the issue

@mashhurs
Copy link
Contributor Author

Had created a doc for fingerpint case but not sure if we have a recommended resolution for TSDS case: https://github.com/mashhurs/logstash/blob/docs-409-issue/docs/static/troubleshoot/ts-elasticsearch.asciidoc

@mashhurs mashhurs linked a pull request Aug 27, 2024 that will close this issue
5 tasks
@mashhurs
Copy link
Contributor Author

Mostly, we get document level http-409 with following cases:

1. Customers are manually setting document _id (not specific to agent -> LS)

They need to ensure _id uniqueness. If you see document_id setting (maybe with a dynamic sprintf style like "%{[@metadata][id]}") in es-output plugin of pipeline configs, most likely this is a cause. This can be double confirmed if you see exact doc ID, {:_id=>supposed-to-be-unique-id, ....

2. [elastic-agent -> LS] Logstash is experiencing backpressure

When {ls} faces backpressure, it cannot acknowledge the events back to elastic-agent and as a result agent timeouts, resends the events.
However, previously sent events are (fully or partially) indexed into ES.
From the agent side, extending timeout in elastic-agent configuration may improve the situation.
However, it is highly recommended to

3. [elastic-agent -> LS] Time series data stream (TSDS) based integration

_id for TSDS documents is a hash value, calculated by integrations based on https://www.elastic.co/guide/en/elasticsearch/reference/current/tsds.html#time-series-dimension[document's dimentions] and @timestamp. Depending on dimension granularity and the time frequency ES receives documents, _id might not be unique.
Based on TSDS guide, it can be identified if index is time_series or not by taking a look at its index template (index.mode: time_series). In the index template, we can also see which fields are dimensions ("time_series_dimension": true). You may need to analyze figuring out dimension + @timestamp uniqueness.

With this symptom, I haven't seen the _id is explicitly set, it is mostly nil (see the error example below). Reproduced fresh example of error:

[2024-08-26T15:11:43,587][WARN ][logstash.outputs.elasticsearch][main][5da2de1a6015f56bca61a3878c1e397ea47ce4da8031f78e07087f62e900c6c6] Failed action {:status=>409, :action=>["create", {:_id=>nil, :_index=>"metrics-system.network-default", :routing=>nil}, {"@timestamp"=>2024-08-26T21:48:48.268Z, "agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-98df506ad305", "type"=>"metricbeat", "ephemeral_id"=>"9f7fc9b8-a93c-4252-8312-41668e9bb310", "name"=>"mashhurs-host, "version"=>"8.14.1"}, "system"=>{"network"=>{"out"=>{"packets"=>1, "dropped"=>0, "bytes"=>100, "errors"=>0}, "name"=>"utun0", "in"=>{"packets"=>0, "dropped"=>0, "bytes"=>0, "errors"=>0}}}, "service"=>{"type"=>"system"}, "metricset"=>{"name"=>"network", "period"=>1000}, "ecs"=>{"version"=>"8.0.0"}, "data_stream"=>{"namespace"=>"default", "type"=>"metrics", "dataset"=>"system.network"}, "tags"=>["beats_input_raw_event"], "event"=>{"duration"=>30177083, "dataset"=>"system.network", "module"=>"system"}, "@version"=>"1", "elastic_agent"=>{"id"=>"95e1ef4e-6e7a-4cfd-88e8-....", "snapshot"=>false, "version"=>"8.14.1"}}], :response=>{"create"=>{"status"=>409, "error"=>{"type"=>"version_conflict_engine_exception", "reason"=>"[X2S_eqC1kCpFpb1lAAABkZCpuww][LI_yqguJnd4yG2gWc-sq5af-yVYaXIwj-eRIWXnMt35ym4EgBCfmRK6iivMF@2024-08-26T21:48:48.268Z]: version conflict, document already exists (current version [1])", "index_uuid"=>"YJE1kdgbTvuNeS23sbDVQg", "shard"=>"0", "index"=>".ds-metrics-system.network-default-2024.08.26-000003"}}}}

4. Can be ignored
es-output has a silence_errors_in_log option specifically useful for this situations. Since the doc is already indexed, it is up to your preference to ignore it.
Example config:

    output {
      elasticsearch {
        silence_errors_in_log => ["version_conflict_engine_exception"]
      }
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants