-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo Compactions Failing, CloudFlare R2 Bucket Always Increasing, Compactor throwing error: "error completing block: error completing multipart upload" #4099
Comments
This seems to be an incompatibility between R2 and S3. Here is a similar discussion on cloudflare's forums: https://community.cloudflare.com/t/all-non-trailing-parts-must-have-the-same-length/552190 It seems that when pushing a multipart upload to R2 all segments must have the same length except for the final part. When uploading a block Tempo currently flushes a rowgroup at a time which has a variable number of bytes. This is not easily corrected in Tempo. Unless you find some very clean way to resolve this in Tempo we would likely not take a PR to "fix" this issue. To use Tempo with R2 we will likely need R2 to be S3 compatible. |
Thanks for your reply @joe-elliott. Yes, I ran into that post too, and a few others, but I figured maybe there is a way to configure Tempo to send chunks of the same length. Wouldn't setting Another interesting thing to note is that some compaction cycles do end up completing successfully. Not all fail with the
I wonder if there are some "sweet spot" settings with the compactor that would get me less failed compaction flushes. Maybe setting a low |
Technically you can raise your row group size so that compactors/ingesters always flush parquet files with exactly one row group, but this will not work except for the smallest Tempo installs. |
You are probably restricting the compactors to only creating very small blocks. These blocks all have 1 row group so the multipart upload is succeeding. |
Describe the bug
About a week ago, our Tempo alerts for compaction failures started firing.
This is the query that we use. If it's above 1, the alerts fire.
I checked the compactor logs (we only have one compactor running), and the only suspicious thing that I've noticed is messages like this:
On top of that, I also noticed that the r2 bucket that we use for traces is ever increasing in size, which didn't happen before observing the compaction errors.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expect that there would be no compaction errors
Environment:
Additional Context
I've looked at this runbook, and have tried increasing the memory of the compactor pod significantly and adjusting
compaction_window
to 30min. I've also tried adjustingmax_block_bytes
to 5G, but the problem still persist. Some other issues that I've come across are:#3529
#1774
After looking at those issues, i've since adjusted my bucket policies to delete objects one day after my tempo block retention period per @joe-elliott 's suggestion here.
The text was updated successfully, but these errors were encountered: