Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage controller proxies requests by intent leading to unavailability #9062

Open
VladLazar opened this issue Sep 19, 2024 · 0 comments · May be fixed by #9065
Open

Storage controller proxies requests by intent leading to unavailability #9062

VladLazar opened this issue Sep 19, 2024 · 0 comments · May be fixed by #9065
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/bug Issue Type: Bug

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Sep 19, 2024

I looked at the cloudbench run on 2024-09-18 (link).

Timeline

It failed to create a few branches:

2024-09-18T22:06:38.858Z
ERROR
cloudbench	creating branch for project failed: decode response: error: code 500: {Code: Message:unknown error}	
{"unit": 5252, "project_id": "wild-frog-19741510", "times": 6, 
 "error": "creating branch for project failed: decode response: error: code 500: {Code: Message:unknown error}"}

Control plane proxied the request to the storage controller but got a 404 error (logs)

{"level":"ERR","ts":"2024-09-18T22:05:45.992Z","logger":"publicapiv2","message":"incoming request finished with error: internal: UNKNOWN: could not create project-branch: pageserver error","http_meth":"POST","http_path":"/api/v2/projects/wild-frog-19741510/branches","route":"CreateProjectBranch","request_id":"b3be7202-08c8-4623-b113-23383b397795","trace_id":"QDXxW2SC4esgx4jLzLQgsp","project_id":"wild-frog-19741510","account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","ingress_duration_ms":693,"status":500,"account_id":"3eeaaef0-50fa-4074-8ed7-0a20f097d9fb","status":404,"message":"NotFound: tenant fa8e211cd9784317f0143c713e3cbb09","error":"incoming request finished with error: internal: UNKNOWN: could not create project-branch: pageserver error"}

Storage controller received the request and proxied it to a pageserver, but got a 404 error back (logs)

2024-09-18T22:05:45.471294Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=b3be7202-08c8-4623-b113-23383b397795}: Proxying request for tenant fa8e211cd9784317f0143c713e3cbb09 (/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97)
	
2024-09-18T22:05:45.991205Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=b3be7202-08c8-4623-b113-23383b397795}: Request handled, status: 404 Not Found

Pageserver received the request, but didn't have the tenant attached to it (logs):

2024-09-18T22:05:45.487653Z  INFO request{method=GET path=/v1/tenant/fa8e211cd9784317f0143c713e3cbb09/timeline/4a0c742862f8ec32f703684a4f7f3e97 request_id=17f9d525-64e1-4244-b743-362706975271}: Error processing HTTP request: NotFound: tenant fa8e211cd9784317f0143c713e3cbb09

What happened?

Pageserver holding shard 0 for tenant fa8e211cd9784317f0143c713e3cbb09 was briefly marked as offline:

2024-09-18T22:05:44.162464Z  INFO spawn_heartbeat_driver: Node 9355 transition to offline
...
2024-09-18T22:05:52.695691Z  INFO spawn_heartbeat_driver: Node 9355 transition to active

In response to this, the intent state was updated for shard 0 of tenant fa8e211cd9784317f0143c713e3cbb09 and reconciles
triggered. The reconcile for shard 0 of the tenant in question got stuck waiting on the semaphore:

2024-09-18T22:05:44.179449Z  INFO spawn_heartbeat_driver: Concurrency limited: enqueued for reconcile later tenant_id=fa8e211cd9784317f0143c713e3cbb09 shard_id=0000

When proxying requests to pageservers we use the intent state and hope that it matches reality (code).

That was not the case since we changed the intent state in response to the node going offline, so we proxied the request to
the wrong pageserver.

@VladLazar VladLazar added c/storage Component: storage c/storage/controller Component: Storage Controller t/bug Issue Type: Bug labels Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/controller Component: Storage Controller c/storage Component: storage t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant