-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gather metrics for clouflare vs ovh websockets #16034
Conversation
I think the PR will be easier to review after master, including #16019, is merged into it |
It will also be easier to review once I undo some moving around of stuff in socket.ts while editing. I wasn't thinking about what that would do to the diff. |
|
||
def isUserInTestBucket(net: NetConfig)(using ctx: Context) = | ||
net.socketTest && ctx.pref.usingAltSocket.isEmpty && ctx.userId.exists: userId => | ||
java.security.MessageDigest.getInstance("MD5").digest(userId.value.getBytes).head % 128 == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about ctx.me.exists(_.createdAt.getNano % 128 == 0)
, that looks cheaper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as there is a lila endpoint or dev cli command to check manually (for support). i'm not super attached and saving cycles spent repeatedly xoring numbers seems good, but on the other hand inclusion in the md5 test bucket can be determined with no reliance on prod mongodb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's always username.toList.map(_.toInt).sum % 128
and its shorter similar version username.toList.sum % 128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems fine. i'm not sure we need the avalanche effect of a strong hash. even if the modulo sum of ascii characters results in uneven buckets with weird correlations, it's not a huge deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just don't tell niklas or glbert.
scary changes on socket.ts 👀 Did you try out the affected use cases? |
yes, on iOS safari, macOS safari, linux firefox, linux chrome, and android via browserstack samsung galaxy 20ish (?) from a behavior standpoint, the main difference is we avoid a bunch of scary logs on unexpected connection closures when the browser task queue looks like it was suspended. a false positive (which i never saw) is also no big deal. this will reduce the number of spurious error reports from suspensions on mobile devices in the collected metrics by roughly 90%. it's not perfect as it's still possible to fail detecting a mobile app suspension when the task queue is suspended from 8.5 to 9.5 seconds after an unacknowledged ping, but this should be good enough. i played with every type of page lifecycle event imaginable and got the most consistent behavior across all browsers with the task queue monitor approach. the other functional changes are mostly just to ensure cf and ovh are properly rotated in all scenarios. |
that said, maybe niklas should have a look |
@@ -18,11 +18,10 @@ interface MsgOut extends MsgBase {} | |||
interface MsgAck extends MsgOut { | |||
at: number; | |||
} | |||
type Send = (t: Tpe, d: Payload, o?: any) => void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't used anywhere, so the type is described directly where it's used below in Ackable
|
||
const isOnline = () => !('onLine' in navigator) || navigator.onLine; | ||
|
||
// versioned events, acks, retries, resync | ||
export default class StrongSocket { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- tie public method typings to the implementation (via extending
SocketI
which is defined inindex.d.ts
) - make non-public members private (more as an annotation than access restriction, as they are not available in
SocketI
anyway)
tryOtherUrl = false; | ||
autoReconnect = true; | ||
nbConnects = 0; | ||
storage: LichessStorage = storage.make( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix broken this.storage
assignment which never actually created 'surl-alt' because socket-alternates
is not a valid data attribute, we have been using socket-alts
private _sign?: string; | ||
private resendWhenOpen: [string, any, any][] = []; | ||
private baseUrls = (document.body.dataset.socketAlts || document.body.dataset.socketDomains!).split(','); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we always get our socket domains in the same attribute now
|
||
constructor( | ||
readonly url: string, | ||
version: number | false, | ||
settings: Partial<Settings> = {}, | ||
) { | ||
this.options = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i moved initialization of the various options, settings, params to the constructor as they are all instance related, whether or not they're hard coded or set by constructor arguments
@@ -90,49 +93,44 @@ export default class StrongSocket { | |||
...(settings.params || {}), | |||
}, | |||
}; | |||
const customPingDelay = storedIntProp('socket.ping.interval', 2500)(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customPingDelay
was never used
this.version = version; | ||
site.pubsub.on('socket.send', this.send); | ||
this.connect(); | ||
this.flushStats(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushStats
is related to the socket test.
so is the 'pagehide'
handler. these are discussed below.
cl.remove('offline'); | ||
cl.add('online'); | ||
cl.toggle('reconnected', this.nbConnects > 1); | ||
this.onSuccess(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some non-critical stuff shuffled around here, we could actually move all of it to onSuccess, or onOpen if we wanted to maintain uniformity with the websocket event
}; | ||
|
||
scheduleConnect = (delay: number): void => { | ||
private scheduleConnect = (delay: number = this.options.pongTimeout): void => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.pongTimeout
is our nominal connect interval, so it's the default value for this call
// if this was set earlier, we've already logged the error | ||
this.tryOtherUrl = true; | ||
log(`sri ${this.settings.params!.sri} timeout ${delay}ms, trying ${this.baseUrl()}${this.url}`); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the socket permalogs are noisy and mostly serve to obscure some of the more interesting uncaught exceptions and unhandled promise rejections we get from lichess.org/#debug
ws.onerror = ws.onclose = ws.onopen = ws.onmessage = () => {}; | ||
ws.close(); | ||
} | ||
}; | ||
|
||
onError = (e: unknown): void => { | ||
private onError = (e: unknown): void => { | ||
if (this.heartbeat.wasSuspended) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we ignore errors when they are the expected consequence of task queue suspension
site.pubsub.emit('socket.close'); | ||
if (this.autoReconnect) { | ||
|
||
if (this.heartbeat.wasSuspended) return this.onSuspended(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to have valid data, it is critical that we can distinguish task queue suspensions from real network/application errors
site.pubsub.emit('socket.close'); | ||
if (this.autoReconnect) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.autoReconnect
is actually redundant with this.ws
having a value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about that. disconnect
sets autoReconnect
to false to prevent reconnection. site.reload
calls disconnect
to prevent websocket activity while the page is reloading. Sometimes the reload is slow and we don't want the websocket to reconnect while it happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
site.reload calls destroy now. disconnect has been removed from the interface
this.debug('Will autoreconnect in ' + this.options.autoReconnectDelay); | ||
this.scheduleConnect(this.options.autoReconnectDelay); | ||
} | ||
if (e.wasClean && e.code < 1002) return; | ||
|
||
if (isOnline()) | ||
log(`${site?.sri ? 'sri ' + site.sri : ''} unclean close ${e.code} ${url} ${e.reason}`); | ||
this.tryOtherUrl = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above about socket permalogs
clearTimeout(this.pingSchedule); | ||
}; | ||
|
||
onSuccess = (): void => { | ||
this.nbConnects++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.nbConnects
and the related check are redundant with the one time event 'hasConnected'
clearTimeout(this.pingSchedule); | ||
clearTimeout(this.connectSchedule); | ||
this.storeStats({ event: 'suspend' }).then(this.connect); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method is called when a previously loaded page that was suspended becomes active again. we arrive here from close
events that are generated for the websocket by the browser on activation.
@@ -350,14 +352,66 @@ export default class StrongSocket { | |||
|
|||
pingInterval = (): number => this.computePingDelay() + this.averageLag; | |||
getVersion = (): number | false => this.version; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these three methods are the bulk of the socket test code. we have three phases.
-
Track lag stats in memory with
updateStats
-
On orderly connection close (rare) or 'pagehide' (most of our terminations), we call
storeStats
which dumps the current in-memory stats to local storage. it also callsflushStats
but this is only relevant for the orderly close case because that method is async and task & microtask scheduling are no longer available in a pagehide event. -
On page loads, we call
flushStats
which checks local storage for stats from the previous session and sends them to idb (which is async). if the test has ended and there are values in local storage or idb, they are posted in one big batch to the collection endpoint and cleaned up on success.
@@ -168,3 +168,32 @@ export function debounce<T extends (...args: any) => void>( | |||
}, wait); | |||
}; | |||
} | |||
|
|||
export function browserTaskQueueMonitor(interval = 1000): { wasSuspended: boolean; reset: () => void } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the logic for determining a task queue suspension occurred. false positives are possible in extreme scheduling scenarios, but in the socket case they just result in a rare misclassification of event types in the gathered data.
any other questions about the code in this? |
yes, #16034 (comment) |
net.socket.test=true
in application.conf, restart lilanet.socket.test=false
in application.conf, restart lilamongodump -d lichess -c socket_test -o ./for-schlawg
for-schlawg
and give to me.db.socket_test.drop()