Final batch of changes for 8.0 GA release #1030

PingXie · 2024-09-13T20:20:20Z

No description provided.

Add help wording for cluster SLOT-STATS. Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…valkey-io#822) The cluster configuration file is the metadata "database" for the cluster. It is best to trigger a save when shutdown the server, to avoid inconsistent content that is not refreshed. We save the nodes.conf whenever something that affects the nodes.conf has changed. But we are saving nodes.conf in clusterBeforeSleep, and some events may save it without a fsync, there is a time gap. And shutdown has its own save seems good to me, it doesn't need to care about the others. At the same time, a comment is added in unlock nodes.conf to explain why we actively unlock when shutdown. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Print the full client info by using catClientInfoString, the info is useful when we want to identify the source of request. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…updated in `clusterProcessGossipSection` (valkey-io#965) `clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary. Signed-off-by: Ping Xie <[email protected]>

Minor cleanup, introduced in valkey-io#877. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…ation (valkey-io#989) Prior to comparing the replica buffer against the configured limit, we need to ensure that the limit configuration is enabled. If the limit is set to zero, it indicates that there is no limit, and we should skip the buffer limit check. --------- Signed-off-by: naglera <[email protected]> Signed-off-by: Ping Xie <[email protected]>

valkey-io#886) If we modify aof-use-rdb-preamble in the middle of rewrite, we may get a wrong aof base suffix. This is because the suffix is concatenated by the main process afterwards, and it may be different from the beginning. We cache this value when we start the rewrite. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled. --------- Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Signed-off-by: zhaozhao.zz <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Fix for valkey-io#997 Root Cause Analysis: 1. Two different jobs (READ and WRITE) may be sent to the same IO thread. 2. When processing the read job in `processIOThreadsReadDone`, the IO thread may find that the write job has also been completed. 3. In this case, the IO thread calls `processClientIOWriteDone` to first process the completed write job and free the COBs https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L4666 4. If there are pending writes (resulting from pipeline commands), a new async IO write job is sent before processing the completed read job https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L2417 When sending the write job, the `TLS_CONN_FLAG_POSTPONE_UPDATE_STATE` flag is set to prevent the IO thread from updating the event loop, which is not thread-safe. 5. Upon resuming the read job processing, the flag is cleared, https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L4685 causing the IO thread to update the event loop. Fix: Prevent sending async write job for pending writes when a read job is about to be processed. Testing: The issue could not be reproduced due to its rare occurrence, which requires multiple specific conditions to align simultaneously. Signed-off-by: Uri Yagelnik <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Maybe partially resolves valkey-io#952. The hostnames test relies on an assumption that node zero and node six don't communicate with each other to test a bunch of behavior in the handshake stake. This was done by previously dropping all meet packets, however it seems like there was some case where node zero was sending a single pong message to node 6, which was partially initializing the state. I couldn't track down why this happened, but I adjusted the test to simply pause node zero which also correctly emulates the state we want to be in since we're just testing state on node 6, and removes the chance of errant messages. The test was failing about 5% of the time locally, and I wasn't able to reproduce a failure with this new configuration. --------- Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…valkey-io#995) While doing some profiling, I noticed that getKeySlot() was a fairly large part (~0.7%) of samples doing perf with high pipeline during standalone. I think this is because we do a very late check for server.cluster_mode, we first call getKeySlot() and then call calculateKeySlot(). (calculateKeySlot was surprisingly not automatically inlined, we were doing a jump into it and then immediately returning zero). We then also do useless work in the form of caching zero in client->slot, which will further mess with cache lines. So, this PR tries to accomplish a few things things. 1) The usage of the `slot` name made a lot more sense before the introduction of the kvstore. Now with kvstore, we call this the database index, so all the references to slot in standalone are no longer really accurate. 2) Pull the cluster mode check all the way out of getKeySlot(), so hopefully a bit more performant. 3) Remove calculateKeySlot() as independent from getKeySlot(). calculateKeySlot used to have 3 call sites outside of db.c, which warranted it's own function. It's now only called in two places, pubsub.c and networking.c. I ran some profiling, and saw about ~0.3% improvement, but don't really trust it because you'll see a much higher (~2%) variance in test runs just by how the branch predictions will get changed with a new memory layout. Running perf again showed no samples in getKeySlot() and a reduction in samples in lookupKey(), so maybe this will help a little bit. --------- Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…lkey-io#1025) Update client help output message for new command: client capa redirect --------- Signed-off-by: hwware <[email protected]> Signed-off-by: Binbin <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: Ping Xie <[email protected]>

The node may not be able to initiate an election in time due to problems with cluster communication. If an election is initiated, make sure its offset is 0. Closes valkey-io#967. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

PingXie · 2024-09-13T20:26:52Z

@valkey-io/core-team PTAL.

Will prepare the release notes after merging this PR.

codecov · 2024-09-13T20:41:07Z

Codecov Report

Attention: Patch coverage is 81.48148% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (85a5847) to head (89ec1d0).
Report is 33 commits behind head on 8.0.

Files with missing lines	Patch %	Lines
src/networking.c	14.28%	6 Missing ⚠️
src/cluster_legacy.c	85.00%	3 Missing ⚠️
src/debug.c	0.00%	3 Missing ⚠️
src/server.c	81.81%	2 Missing ⚠️
src/module.c	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              8.0    #1030      +/-   ##
==========================================
- Coverage   70.61%   70.58%   -0.03%     
==========================================
  Files         114      114              
  Lines       61640    61665      +25     
==========================================
+ Hits        43528    43529       +1     
- Misses      18112    18136      +24

Files with missing lines	Coverage Δ
src/aof.c	`80.07% <100.00%> (-0.06%)`	⬇️
src/cluster.c	`88.38% <ø> (ø)`
src/db.c	`88.42% <100.00%> (+<0.01%)`	⬆️
src/kvstore.c	`96.14% <100.00%> (+0.03%)`	⬆️
src/pubsub.c	`96.36% <100.00%> (+0.01%)`	⬆️
src/replication.c	`87.08% <100.00%> (+0.17%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/module.c	`9.64% <0.00%> (ø)`
src/server.c	`88.63% <81.81%> (+0.04%)`	⬆️
src/cluster_legacy.c	`85.89% <85.00%> (-0.23%)`	⬇️
... and 2 more

... and 10 files with indirect coverage changes

PingXie · 2024-09-13T21:51:01Z

Failures are a known issue #970 and it is fixed by #981.

… in advance (valkey-io#981) Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas There is a timing bug where the primary and replica have different `cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if the replica receives `CLUSTER SETSLOT` before the gossip update, it remains in the original shard. This happens because we only process the `cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`. This commit fixes the issue by also evaluating this flag for replicas in the `CLUSTER SETSLOT` path, ensuring correct replica migration behavior. Closes valkey-io#970 --------- Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

kyle-yh-kim and others added 14 commits September 13, 2024 13:25

Add SLOT-STATS under CLUSTER HELP string. (valkey-io#988)

249777f

Add help wording for cluster SLOT-STATS. Signed-off-by: Kyle Kim <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Add client info to SHUTDOWN / CLUSTER FAILOVER logs (valkey-io#875)

680c9cc

Print the full client info by using catClientInfoString, the info is useful when we want to identify the source of request. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

Add newline to argv in crash report when doing redact (valkey-io#993)

d949c02

Minor cleanup, introduced in valkey-io#877. Signed-off-by: Binbin <[email protected]> Signed-off-by: Ping Xie <[email protected]>

add assertion for kvstore's dictType (valkey-io#1004)

507aa39

Signed-off-by: zhaozhao.zz <[email protected]> Signed-off-by: Ping Xie <[email protected]>

PingXie force-pushed the 80ga branch from c464ec7 to 8c96da6 Compare September 13, 2024 20:26

PingXie added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 13, 2024

zuiderkwast mentioned this pull request Sep 14, 2024

8.0 GA Release #1031

Merged

enjoy-binbin closed this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final batch of changes for 8.0 GA release #1030

Final batch of changes for 8.0 GA release #1030

PingXie commented Sep 13, 2024

PingXie commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading

PingXie commented Sep 13, 2024

Final batch of changes for 8.0 GA release #1030

Final batch of changes for 8.0 GA release #1030

Conversation

PingXie commented Sep 13, 2024

PingXie commented Sep 13, 2024 • edited Loading

codecov bot commented Sep 13, 2024 • edited Loading

Codecov Report

PingXie commented Sep 13, 2024

PingXie commented Sep 13, 2024 •

edited

Loading

codecov bot commented Sep 13, 2024 •

edited

Loading