Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor failures #4091

Open
Riscky opened this issue Sep 17, 2024 · 3 comments
Open

Compactor failures #4091

Riscky opened this issue Sep 17, 2024 · 3 comments

Comments

@Riscky
Copy link

Riscky commented Sep 17, 2024

Describe the bug

We have set up a small Tempo cluster, and it’s running mostly fine.
However, we are getting compactor errors almost every compaction cycle):

googleapi: Error 404: No such object: /single-tenant/<block_id>/meta.json, notFound

The error message (msg field) is sometimes slightly different:

failed to mark block compacted during retention

unable to mark block compacted

failed to clear compacted block during retention

but the inner error (err field) is always the above

Based on what we can find about these errors (#2560, #1270, https://community.grafana.com/t/cannot-find-traceids-in-s3-blocks/45423/10 ), it would appear that the compactor ring is not (correctly) formed.
I have also seen occurrences where 2 nodes where trying to compact the same block, so that checks out.
However, the ring status page looks fine (all three nodes show as active).
The ingester ring has formed and we haven't seen any problems with that.

To Reproduce

I haven't been able to reproduce this in a different environment yet.

We're using Tempo version 2.5.0, with Consul as the kv store, GCS as block storage. We're running 3 nodes in scalable-single-binary mode, on a Nomad cluster.

Expected behavior

No compaction errors

Environment:

  • Infrastructure: Nomad cluster

Additional Context

Our configuration insofar it seems relevant to the issue:

# We run Grafana Tempo in ScalableSingleBinary mode. 
target: "scalable-single-binary"

ingester:
  lifecycler:
    id: "tempo-ingester@${NOMAD_ALLOC_ID}"
    ring:
      kvstore:
        store: "consul"
        prefix: "tempo/ingesters/"
        consul:
          host: "127.0.0.1:8500"

compactor:
  ring:
    instance_id: "tempo-compactor@${NOMAD_ALLOC_ID}"
    instance_addr: "0.0.0.0"
    # The default value is 60s, but we want to be able to restart Tempo a bit quicker.
    wait_stability_min_duration: "10s"
    kvstore:
      store: "consul"
      prefix: "tempo/compactors/"
      consul:
        host: "127.0.0.1:8500"

storage:
  trace:
    backend: "gcs"

    gcs:
      bucket_name: "<some bucket>"

The compaction summary looks fine to me:

+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+
| LVL |   BLOCKS   |           TOTAL            |     SMALLEST BLOCK      |      LARGEST BLOCK       |    EARLIEST    |     LATEST     | BLOOM SHARDS |
+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+
|   0 | 2 (0 %)    | 29,579 objects (158 MB)    | 14,387 objects (77 MB)  | 15,192 objects (81 MB)   | 39m46s ago     | 2m20s ago      |            2 |
|   1 | 8 (2 %)    | 73,684 objects (676 MB)    | 3,599 objects (39 MB)   | 45,877 objects (189 MB)  | 18h35m42s ago  | 30m21s ago     |            8 |
|   2 | 62 (17 %)  | 2,381,881 objects (12 GB)  | 3,708 objects (46 MB)   | 113,031 objects (434 MB) | 337h6m6s ago   | 1h0m21s ago    |           75 |
|   3 | 43 (12 %)  | 1,317,291 objects (8.1 GB) | 7,586 objects (86 MB)   | 104,075 objects (434 MB) | 335h5m44s ago  | 3h0m41s ago    |           49 |
|   4 | 113 (32 %) | 3,560,727 objects (22 GB)  | 8,736 objects (51 MB)   | 106,820 objects (421 MB) | 316h1m37s ago  | 27h4m12s ago   |          179 |
|   5 | 35 (10 %)  | 771,399 objects (5.7 GB)   | 9,206 objects (30 MB)   | 76,451 objects (328 MB)  | 273h10m55s ago | 25h3m51s ago   |           45 |
|   6 | 80 (22 %)  | 2,685,414 objects (17 GB)  | 10,363 objects (134 MB) | 114,190 objects (436 MB) | 259h49m37s ago | 73h46m0s ago   |          112 |
|   8 | 6 (1 %)    | 584,248 objects (2.2 GB)   | 84,923 objects (334 MB) | 124,142 objects (398 MB) | 175h4m37s ago  | 148h52m11s ago |           17 |
+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+
@mapno
Copy link
Member

mapno commented Sep 18, 2024

Hi! Is it possible that GCS retention is lower or equal than Tempo retention? That could explain it.

@Riscky
Copy link
Author

Riscky commented Sep 18, 2024

Thanks for the suggestion. We have no lifecycle rules on the bucket at the moment (removed those to check if those were the issue)

@mapno
Copy link
Member

mapno commented Sep 20, 2024

TBH, I'm a bit lost. I'm now wondering about the address that the compactors are advertising in the ring instance_addr: "0.0.0.0", that might be messing up something.

I have also seen occurrences where 2 nodes where trying to compact the same block, so that checks out.

This heavily points in the ring's direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants