Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures in test_scrubber_physical_gc #8928

Open
jcsp opened this issue Sep 5, 2024 · 2 comments · May be fixed by #9045
Open

Failures in test_scrubber_physical_gc #8928

jcsp opened this issue Sep 5, 2024 · 2 comments · May be fixed by #9045
Assignees
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 5, 2024

Since 4th Sep

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8681/10718968862/index.html#testresult/9177f6b50b1cbc31/retries

@jcsp jcsp added a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Sep 5, 2024
@jcsp jcsp changed the title Failures test_scrubber_physical_gc Failures in test_scrubber_physical_gc Sep 5, 2024
@jcsp
Copy link
Contributor Author

jcsp commented Sep 16, 2024

This was an issue with storage controller handling a Detached tenant, which we don't currently do in the field

@VladLazar
Copy link
Contributor

Test does the following in a loop:

  • detach all 4 shard of one tenant
  • wait out reconciles
  • AttachSingle(0) all 4 shards of the tenant

When we detach we don't update the compute hook state. It still points to the detached pageservers.
When the first shard finishes its location config it goes ahead with re-configuring the compute with a mixture
of new state (for the shard that was just reconciled) and old state. Compute tries to prefetch something as
part of reconfiguration and we get a deadlock. In prod the cplane database acts as a buffer to mask this eventual consistency.

We can fix this by updating hook state on detach.

An interesting question that arises from this is: "Should we notify cplane about detaches?". It complicates the interaction
between services, but ensures that a compute can't send requests with a stale pageservers list.

VladLazar added a commit that referenced this issue Sep 18, 2024
Problem

Previously, the storage controller may send compute notifications
containing stale pageservers (i.e. pageserver serving the shard was
detached). This happened because detaches did not update the compute
hook state.

Summary of Changes

Update compute hook state on shard detach.

Fixes #8928
@VladLazar VladLazar linked a pull request Sep 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/test Area: related to testing c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants