Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gather metrics for clouflare vs ovh websockets #16034

Merged
merged 11 commits into from
Sep 19, 2024

Conversation

schlawg
Copy link
Collaborator

@schlawg schlawg commented Sep 10, 2024

  • set net.socket.test=true in application.conf, restart lila
  • wait 1-2 weeks
  • set net.socket.test=false in application.conf, restart lila
  • wait 1-2 weeks
  • mongodump -d lichess -c socket_test -o ./for-schlawg
  • zip for-schlawg and give to me.
  • db.socket_test.drop()
  • delete test code

@ornicar
Copy link
Collaborator

ornicar commented Sep 11, 2024

I think the PR will be easier to review after master, including #16019, is merged into it

@schlawg
Copy link
Collaborator Author

schlawg commented Sep 11, 2024

It will also be easier to review once I undo some moving around of stuff in socket.ts while editing. I wasn't thinking about what that would do to the diff.


def isUserInTestBucket(net: NetConfig)(using ctx: Context) =
net.socketTest && ctx.pref.usingAltSocket.isEmpty && ctx.userId.exists: userId =>
java.security.MessageDigest.getInstance("MD5").digest(userId.value.getBytes).head % 128 == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about ctx.me.exists(_.createdAt.getNano % 128 == 0), that looks cheaper

Copy link
Collaborator Author

@schlawg schlawg Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as there is a lila endpoint or dev cli command to check manually (for support). i'm not super attached and saving cycles spent repeatedly xoring numbers seems good, but on the other hand inclusion in the md5 test bucket can be determined with no reliance on prod mongodb.

Copy link
Collaborator

@ornicar ornicar Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's always username.toList.map(_.toInt).sum % 128 and its shorter similar version username.toList.sum % 128

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems fine. i'm not sure we need the avalanche effect of a strong hash. even if the modulo sum of ascii characters results in uneven buckets with weird correlations, it's not a huge deal.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just don't tell niklas or glbert.

@ornicar
Copy link
Collaborator

ornicar commented Sep 11, 2024

scary changes on socket.ts 👀 Did you try out the affected use cases?

@schlawg
Copy link
Collaborator Author

schlawg commented Sep 11, 2024

yes, on iOS safari, macOS safari, linux firefox, linux chrome, and android via browserstack samsung galaxy 20ish (?)

from a behavior standpoint, the main difference is we avoid a bunch of scary logs on unexpected connection closures when the browser task queue looks like it was suspended. a false positive (which i never saw) is also no big deal.

this will reduce the number of spurious error reports from suspensions on mobile devices in the collected metrics by roughly 90%. it's not perfect as it's still possible to fail detecting a mobile app suspension when the task queue is suspended from 8.5 to 9.5 seconds after an unacknowledged ping, but this should be good enough.

i played with every type of page lifecycle event imaginable and got the most consistent behavior across all browsers with the task queue monitor approach.

the other functional changes are mostly just to ensure cf and ovh are properly rotated in all scenarios.

@schlawg
Copy link
Collaborator Author

schlawg commented Sep 11, 2024

that said, maybe niklas should have a look

@@ -18,11 +18,10 @@ interface MsgOut extends MsgBase {}
interface MsgAck extends MsgOut {
at: number;
}
type Send = (t: Tpe, d: Payload, o?: any) => void;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't used anywhere, so the type is described directly where it's used below in Ackable


const isOnline = () => !('onLine' in navigator) || navigator.onLine;

// versioned events, acks, retries, resync
export default class StrongSocket {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • tie public method typings to the implementation (via extending SocketI which is defined in index.d.ts)
  • make non-public members private (more as an annotation than access restriction, as they are not available in SocketI anyway)

tryOtherUrl = false;
autoReconnect = true;
nbConnects = 0;
storage: LichessStorage = storage.make(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix broken this.storage assignment which never actually created 'surl-alt' because socket-alternates is not a valid data attribute, we have been using socket-alts

private _sign?: string;
private resendWhenOpen: [string, any, any][] = [];
private baseUrls = (document.body.dataset.socketAlts || document.body.dataset.socketDomains!).split(',');
Copy link
Collaborator Author

@schlawg schlawg Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we always get our socket domains in the same attribute now


constructor(
readonly url: string,
version: number | false,
settings: Partial<Settings> = {},
) {
this.options = {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i moved initialization of the various options, settings, params to the constructor as they are all instance related, whether or not they're hard coded or set by constructor arguments

@@ -90,49 +93,44 @@ export default class StrongSocket {
...(settings.params || {}),
},
};
const customPingDelay = storedIntProp('socket.ping.interval', 2500)();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

customPingDelay was never used

this.version = version;
site.pubsub.on('socket.send', this.send);
this.connect();
this.flushStats();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushStats is related to the socket test.

so is the 'pagehide' handler. these are discussed below.

cl.remove('offline');
cl.add('online');
cl.toggle('reconnected', this.nbConnects > 1);
this.onSuccess();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some non-critical stuff shuffled around here, we could actually move all of it to onSuccess, or onOpen if we wanted to maintain uniformity with the websocket event

};

scheduleConnect = (delay: number): void => {
private scheduleConnect = (delay: number = this.options.pongTimeout): void => {
Copy link
Collaborator Author

@schlawg schlawg Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.pongTimeout is our nominal connect interval, so it's the default value for this call

// if this was set earlier, we've already logged the error
this.tryOtherUrl = true;
log(`sri ${this.settings.params!.sri} timeout ${delay}ms, trying ${this.baseUrl()}${this.url}`);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the socket permalogs are noisy and mostly serve to obscure some of the more interesting uncaught exceptions and unhandled promise rejections we get from lichess.org/#debug

ws.onerror = ws.onclose = ws.onopen = ws.onmessage = () => {};
ws.close();
}
};

onError = (e: unknown): void => {
private onError = (e: unknown): void => {
if (this.heartbeat.wasSuspended) return;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we ignore errors when they are the expected consequence of task queue suspension

site.pubsub.emit('socket.close');
if (this.autoReconnect) {

if (this.heartbeat.wasSuspended) return this.onSuspended();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to have valid data, it is critical that we can distinguish task queue suspensions from real network/application errors

site.pubsub.emit('socket.close');
if (this.autoReconnect) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.autoReconnect is actually redundant with this.ws having a value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that. disconnect sets autoReconnect to false to prevent reconnection. site.reload calls disconnect to prevent websocket activity while the page is reloading. Sometimes the reload is slow and we don't want the websocket to reconnect while it happens.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

site.reload calls destroy now. disconnect has been removed from the interface

this.debug('Will autoreconnect in ' + this.options.autoReconnectDelay);
this.scheduleConnect(this.options.autoReconnectDelay);
}
if (e.wasClean && e.code < 1002) return;

if (isOnline())
log(`${site?.sri ? 'sri ' + site.sri : ''} unclean close ${e.code} ${url} ${e.reason}`);
this.tryOtherUrl = true;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above about socket permalogs

clearTimeout(this.pingSchedule);
};

onSuccess = (): void => {
this.nbConnects++;
Copy link
Collaborator Author

@schlawg schlawg Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.nbConnects and the related check are redundant with the one time event 'hasConnected'

clearTimeout(this.pingSchedule);
clearTimeout(this.connectSchedule);
this.storeStats({ event: 'suspend' }).then(this.connect);
}
Copy link
Collaborator Author

@schlawg schlawg Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is called when a previously loaded page that was suspended becomes active again. we arrive here from close events that are generated for the websocket by the browser on activation.

@@ -350,14 +352,66 @@ export default class StrongSocket {

pingInterval = (): number => this.computePingDelay() + this.averageLag;
getVersion = (): number | false => this.version;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these three methods are the bulk of the socket test code. we have three phases.

  1. Track lag stats in memory with updateStats

  2. On orderly connection close (rare) or 'pagehide' (most of our terminations), we call storeStats which dumps the current in-memory stats to local storage. it also calls flushStats but this is only relevant for the orderly close case because that method is async and task & microtask scheduling are no longer available in a pagehide event.

  3. On page loads, we call flushStats which checks local storage for stats from the previous session and sends them to idb (which is async). if the test has ended and there are values in local storage or idb, they are posted in one big batch to the collection endpoint and cleaned up on success.

@@ -168,3 +168,32 @@ export function debounce<T extends (...args: any) => void>(
}, wait);
};
}

export function browserTaskQueueMonitor(interval = 1000): { wasSuspended: boolean; reset: () => void } {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the logic for determining a task queue suspension occurred. false positives are possible in extreme scheduling scenarios, but in the socket case they just result in a rare misclassification of event types in the gathered data.

@schlawg
Copy link
Collaborator Author

schlawg commented Sep 17, 2024

any other questions about the code in this?

@ornicar
Copy link
Collaborator

ornicar commented Sep 18, 2024

yes, #16034 (comment)

@ornicar ornicar merged commit 2d7ff82 into lichess-org:master Sep 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants