Compactor failures #4091

Riscky · 2024-09-17T14:51:02Z

Describe the bug

We have set up a small Tempo cluster, and it’s running mostly fine.
However, we are getting compactor errors almost every compaction cycle):

googleapi: Error 404: No such object: /single-tenant/<block_id>/meta.json, notFound

The error message (msg field) is sometimes slightly different:

failed to mark block compacted during retention

unable to mark block compacted

failed to clear compacted block during retention

but the inner error (err field) is always the above

Based on what we can find about these errors (#2560, #1270, https://community.grafana.com/t/cannot-find-traceids-in-s3-blocks/45423/10 ), it would appear that the compactor ring is not (correctly) formed.
I have also seen occurrences where 2 nodes where trying to compact the same block, so that checks out.
However, the ring status page looks fine (all three nodes show as active).
The ingester ring has formed and we haven't seen any problems with that.

To Reproduce

I haven't been able to reproduce this in a different environment yet.

We're using Tempo version 2.5.0, with Consul as the kv store, GCS as block storage. We're running 3 nodes in scalable-single-binary mode, on a Nomad cluster.

Expected behavior

No compaction errors

Environment:

Infrastructure: Nomad cluster

Additional Context

Our configuration insofar it seems relevant to the issue:

# We run Grafana Tempo in ScalableSingleBinary mode. 
target: "scalable-single-binary"

ingester:
  lifecycler:
    id: "tempo-ingester@${NOMAD_ALLOC_ID}"
    ring:
      kvstore:
        store: "consul"
        prefix: "tempo/ingesters/"
        consul:
          host: "127.0.0.1:8500"

compactor:
  ring:
    instance_id: "tempo-compactor@${NOMAD_ALLOC_ID}"
    instance_addr: "0.0.0.0"
    # The default value is 60s, but we want to be able to restart Tempo a bit quicker.
    wait_stability_min_duration: "10s"
    kvstore:
      store: "consul"
      prefix: "tempo/compactors/"
      consul:
        host: "127.0.0.1:8500"

storage:
  trace:
    backend: "gcs"

    gcs:
      bucket_name: "<some bucket>"

The compaction summary looks fine to me:

+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+
| LVL |   BLOCKS   |           TOTAL            |     SMALLEST BLOCK      |      LARGEST BLOCK       |    EARLIEST    |     LATEST     | BLOOM SHARDS |
+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+
|   0 | 2 (0 %)    | 29,579 objects (158 MB)    | 14,387 objects (77 MB)  | 15,192 objects (81 MB)   | 39m46s ago     | 2m20s ago      |            2 |
|   1 | 8 (2 %)    | 73,684 objects (676 MB)    | 3,599 objects (39 MB)   | 45,877 objects (189 MB)  | 18h35m42s ago  | 30m21s ago     |            8 |
|   2 | 62 (17 %)  | 2,381,881 objects (12 GB)  | 3,708 objects (46 MB)   | 113,031 objects (434 MB) | 337h6m6s ago   | 1h0m21s ago    |           75 |
|   3 | 43 (12 %)  | 1,317,291 objects (8.1 GB) | 7,586 objects (86 MB)   | 104,075 objects (434 MB) | 335h5m44s ago  | 3h0m41s ago    |           49 |
|   4 | 113 (32 %) | 3,560,727 objects (22 GB)  | 8,736 objects (51 MB)   | 106,820 objects (421 MB) | 316h1m37s ago  | 27h4m12s ago   |          179 |
|   5 | 35 (10 %)  | 771,399 objects (5.7 GB)   | 9,206 objects (30 MB)   | 76,451 objects (328 MB)  | 273h10m55s ago | 25h3m51s ago   |           45 |
|   6 | 80 (22 %)  | 2,685,414 objects (17 GB)  | 10,363 objects (134 MB) | 114,190 objects (436 MB) | 259h49m37s ago | 73h46m0s ago   |          112 |
|   8 | 6 (1 %)    | 584,248 objects (2.2 GB)   | 84,923 objects (334 MB) | 124,142 objects (398 MB) | 175h4m37s ago  | 148h52m11s ago |           17 |
+-----+------------+----------------------------+-------------------------+--------------------------+----------------+----------------+--------------+

The text was updated successfully, but these errors were encountered:

mapno · 2024-09-18T07:20:39Z

Hi! Is it possible that GCS retention is lower or equal than Tempo retention? That could explain it.

Riscky · 2024-09-18T07:53:28Z

Thanks for the suggestion. We have no lifecycle rules on the bucket at the moment (removed those to check if those were the issue)

mapno · 2024-09-20T14:11:07Z

TBH, I'm a bit lost. I'm now wondering about the address that the compactors are advertising in the ring instance_addr: "0.0.0.0", that might be messing up something.

I have also seen occurrences where 2 nodes where trying to compact the same block, so that checks out.

This heavily points in the ring's direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor failures #4091

Compactor failures #4091

Riscky commented Sep 17, 2024

mapno commented Sep 18, 2024

Riscky commented Sep 18, 2024 •

edited

Loading

mapno commented Sep 20, 2024

Compactor failures #4091

Compactor failures #4091

Comments

Riscky commented Sep 17, 2024

mapno commented Sep 18, 2024

Riscky commented Sep 18, 2024 • edited Loading

mapno commented Sep 20, 2024

Riscky commented Sep 18, 2024 •

edited

Loading