Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo Compactions Failing, CloudFlare R2 Bucket Always Increasing, Compactor throwing error: "error completing block: error completing multipart upload" #4099

Open
alextricity25 opened this issue Sep 19, 2024 · 5 comments

Comments

@alextricity25
Copy link

Describe the bug
About a week ago, our Tempo alerts for compaction failures started firing.

This is the query that we use. If it's above 1, the alerts fire.

sum by (cluster, namespace) (increase(tempodb_compaction_errors_total{}[1h]))

Image

I checked the compactor logs (we only have one compactor running), and the only suspicious thing that I've noticed is messages like this:

level=error ts=2024-09-17T16:39:39.029886052Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID da530417-5510-4f22-996f-c35336eea572: error completing block: error completing multipart upload, object: single-tenant/da530417-5510-4f22-996f-c35336eea572/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=error ts=2024-09-17T16:41:23.423707238Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 5b745471-a889-40e9-8152-006625469f6c: error completing block: error completing multipart upload, object: single-tenant/5b745471-a889-40e9-8152-006625469f6c/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=error ts=2024-09-17T16:43:08.882899458Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 055d0b14-7789-492f-99ca-bd0b74aa1f2c: error completing block: error completing multipart upload, object: single-tenant/055d0b14-7789-492f-99ca-bd0b74aa1f2c/data.parquet, obj etag: : All non-trailing parts must have the same length."

On top of that, I also noticed that the r2 bucket that we use for traces is ever increasing in size, which didn't happen before observing the compaction errors.
Image

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo 2.6.0
  2. Configure Tempo to use a CloudFlare R2 Bucket as the s3 backend
  3. Perform Operations (Read/Write/Others)
  4. Observe compaction errors

Expected behavior
I expect that there would be no compaction errors

Environment:

  • Infrastructure: Kubernetes GKE
  • Deployment tool: helm tempo-distributed, values below:
compactor:
  autoscaling:
    enabled: true
    keda:
      enabled: true
      triggers:
      - metadata:
          customHeaders:
            X-Scope-OrgID: metamonitoring
          query: sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor",
            namespace=~".*"}) / ignoring(tenant) group_left count by (cluster, namespace)(tempo_build_info{container="compactor",
            namespace=~".*"})
          serverAddress: http://mimir-distributed-nginx.mimir.svc.cluster.local/prometheus
          threshold: "250"
        type: prometheus
  config:
    compaction:
      block_retention: 336h
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  resources:
    requests:
      cpu: "2"
      memory: 4Gi
distributor:
  autoscaling:
    enabled: true
    targetCPUUtilizationPercentage: 60
  config:
    log_received_spans:
      enabled: true
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  resources:
    requests:
      cpu: "1"
      memory: 256Mi
gateway:
  autoscaling:
    enabled: true
    maxReplicas: 10
    minReplicas: 1
    targetCPUUtilizationPercentage: 60
  enabled: true
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  ingress:
    annotations:
      cert-manager.io/cluster-issuer: issuer-9c66eaea
      nginx.ingress.kubernetes.io/auth-realm: Authentication Required - loki
      nginx.ingress.kubernetes.io/auth-secret: tempo-basic-auth
      nginx.ingress.kubernetes.io/auth-type: basic
    enabled: true
    hosts:
    - host: tempo.xrdm.dev
      paths:
      - path: /
        pathType: Prefix
    ingressClassName: nginx
    tls:
    - hosts:
      - tempo.xrdm.dev
      secretName: tempo-tls-cert
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
global_overrides:
  defaults:
    metrics_generator:
      processors:
      - service-graphs
      - span-metrics
ingester:
  autoscaling:
    enabled: true
    targetCPUUtilizationPercentage: 60
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  resources:
    requests:
      cpu: "1"
      memory: 256Mi
metaMonitoring:
  grafanaAgent:
    enabled: true
    installOperator: false
    logs:
      remote:
        url: http://loki-gateway.loki.svc.cluster.local/loki/api/v1/push
    metrics:
      remote:
        headers:
          X-Scope-OrgID: metamonitoring
        url: http://mimir-distributed-nginx.mimir.svc.cluster.local:80/api/v1/push
  serviceMonitor:
    enabled: true
metricsGenerator:
  config:
    metrics_ingestion_time_range_slack: 10m
    processor:
      service_graphs:
        dimensions:
        - xrdm_deployment
      span_metrics:
        dimensions:
        - xrdm_deployment
    storage:
      remote_write:
      - send_exemplars: true
        url: http://mimir-distributed-nginx.mimir.svc.cluster.local:80/api/v1/push
  enabled: true
querier:
  autoscaling:
    enabled: true
    targetCPUUtilizationPercentage: 60
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  resources:
    requests:
      cpu: "1"
      memory: 256Mi
queryFrontend:
  autoscaling:
    enabled: true
    targetCPUUtilizationPercentage: 60
  extraEnv:
  - name: JAEGER_ENDPOINT
    value: http://tempo-gateway.tempo.svc.cluster.local:80/jaeger/api/traces
  - name: JAEGER_SAMPLER_TYPE
    value: const
  - name: JAEGAR_SAMPLER_PARAM
    value: "1"
  resources:
    requests:
      cpu: "1"
      memory: 256Mi
storage:
  trace:
    backend: s3
    block:
      parquet_row_group_size_bytes: 300MB
    s3:
      access_key: xxx
      bucket: telemetry-traces
      endpoint: xxx
      region: us-east-1
      secret_key: xxx
traces:
  otlp:
    grpc:
      enabled: true
    http:
      enabled: true
usage_report:
  reporting_enabled: false

Additional Context

I've looked at this runbook, and have tried increasing the memory of the compactor pod significantly and adjusting compaction_window to 30min. I've also tried adjusting max_block_bytes to 5G, but the problem still persist. Some other issues that I've come across are:

#3529
#1774

After looking at those issues, i've since adjusted my bucket policies to delete objects one day after my tempo block retention period per @joe-elliott 's suggestion here.

@joe-elliott
Copy link
Member

This seems to be an incompatibility between R2 and S3. Here is a similar discussion on cloudflare's forums:

https://community.cloudflare.com/t/all-non-trailing-parts-must-have-the-same-length/552190

It seems that when pushing a multipart upload to R2 all segments must have the same length except for the final part. When uploading a block Tempo currently flushes a rowgroup at a time which has a variable number of bytes. This is not easily corrected in Tempo.

Unless you find some very clean way to resolve this in Tempo we would likely not take a PR to "fix" this issue. To use Tempo with R2 we will likely need R2 to be S3 compatible.

@alextricity25
Copy link
Author

alextricity25 commented Sep 20, 2024

Thanks for your reply @joe-elliott. Yes, I ran into that post too, and a few others, but I figured maybe there is a way to configure Tempo to send chunks of the same length. Wouldn't setting parquet_row_group_size_bytes effectively do this?

Another interesting thing to note is that some compaction cycles do end up completing successfully. Not all fail with the All non-trailing parts must have the same length error. For example, here are some more logs from the compactor, which show that blocks are indeed successfully being flushed (does "flushed" in this context mean being sent to the backend storage bucket via the s3 api?).

level=info ts=2024-09-20T13:49:05.555318241Z caller=compactor.go:186 msg="beginning compaction" traceID=42f3b0ba265a0113
level=info ts=2024-09-20T13:49:05.555355933Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:ab121e03-8931-49e0-b144-11e908cf6d37 TenantID:single-tenant StartTime:2024-09-16 16:48:10 +0000 UTC EndTime:2024-09-16 17:19:30 +0000 UTC Tot
alObjects:40030 Size:197946447 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:3 DataEncoding: BloomShardCount:1 FooterSize:50483 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:05.653482732Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:8a64f55c-362e-4e10-bba4-f8832e6b2d36 TenantID:single-tenant StartTime:2024-09-16 16:55:20 +0000 UTC EndTime:2024-09-16 17:26:35 +0000 UTC Tot
alObjects:40253 Size:197983074 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:3 DataEncoding: BloomShardCount:1 FooterSize:50483 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:05.791970019Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:69f50bc3-df70-4256-a622-5d3297170408 TenantID:single-tenant StartTime:2024-09-16 17:18:04 +0000 UTC EndTime:2024-09-16 17:56:45 +0000 UTC Tot
alObjects:4927 Size:3881037 CompactionLevel:1 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:18783 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:49:53.774626347Z caller=compactor.go:250 msg="flushed to block" bytes=68231216 objects=14956 values=99996196
level=info ts=2024-09-20T13:50:37.246234046Z caller=compactor.go:250 msg="flushed to block" bytes=68339859 objects=15104 values=99993405
level=error ts=2024-09-20T13:51:19.25979008Z caller=compactor.go:162 msg="error during compaction cycle" err="error shipping block to backend, blockID 607efda4-c183-40b7-89e5-90c19ca6c463: error completing block: error completing multipart upload, object: single-tenan
t/607efda4-c183-40b7-89e5-90c19ca6c463/data.parquet, obj etag: : All non-trailing parts must have the same length."
level=info ts=2024-09-20T13:51:19.2598735Z caller=compactor.go:155 msg="Compacting hash" hashString=single-tenant-479584-0
level=info ts=2024-09-20T13:51:19.259919754Z caller=compactor.go:186 msg="beginning compaction" traceID=36d004b60c5e7aa1
level=info ts=2024-09-20T13:51:19.259962307Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:c0c857ff-d006-4f33-ad48-ea7a7bbe1031 TenantID:single-tenant StartTime:2024-09-16 16:25:10 +0000 UTC EndTime:2024-09-16 16:56:25 +0000 UTC Tot
alObjects:43818 Size:218888946 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:4 DataEncoding: BloomShardCount:1 FooterSize:66804 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:51:19.395246194Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:90f85458-f556-49b2-b78c-41ace5cc89fc TenantID:single-tenant StartTime:2024-09-16 16:17:51 +0000 UTC EndTime:2024-09-16 16:49:20 +0000 UTC Tot
alObjects:45199 Size:224820188 CompactionLevel:0 Encoding:none IndexPageSize:0 TotalRecords:4 DataEncoding: BloomShardCount:1 FooterSize:67197 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:51:19.539027714Z caller=compactor.go:198 msg="compacting block" block="&{Version:vParquet4 BlockID:d36b0a24-644a-4698-9588-07db2b0f9188 TenantID:single-tenant StartTime:2024-09-16 15:47:40 +0000 UTC EndTime:2024-09-16 16:26:15 +0000 UTC Tot
alObjects:10961 Size:11425383 CompactionLevel:1 Encoding:none IndexPageSize:0 TotalRecords:1 DataEncoding: BloomShardCount:1 FooterSize:19148 DedicatedColumns:[] ReplicationFactor:0}"
level=info ts=2024-09-20T13:52:06.777694147Z caller=compactor.go:250 msg="flushed to block" bytes=68436112 objects=16185 values=99991189

I wonder if there are some "sweet spot" settings with the compactor that would get me less failed compaction flushes. Maybe setting a low max_compaction objects or max_block_bytes would ensure that the multi-part upload will send chunks of the same length? I'll play around more with this settings.

@joe-elliott
Copy link
Member

parquet_row_group_size_bytes is a target value. There's no way I'm aware of to enforce an exact row group size in parquet. Each row is added atomically and can't be split to force the row group to hit an exact byte count.

I wonder if there are some "sweet spot" settings with the compactor that would get me less failed compaction flushes.

Technically you can raise your row group size so that compactors/ingesters always flush parquet files with exactly one row group, but this will not work except for the smallest Tempo installs.

@alextricity25
Copy link
Author

Got it. Interestingly enough, I set max_compaction_objects to 10000, and that seems to set things in the right direction...

Image

Image

@joe-elliott
Copy link
Member

Got it. Interestingly enough, I set max_compaction_objects to 10000, and that seems to set things in the right direction...

You are probably restricting the compactors to only creating very small blocks. These blocks all have 1 row group so the multipart upload is succeeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants