Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final batch of changes for 8.0 GA release #1030

Closed
wants to merge 15 commits into from
Closed

Conversation

PingXie
Copy link
Member

@PingXie PingXie commented Sep 13, 2024

No description provided.

kyle-yh-kim and others added 14 commits September 13, 2024 13:25
Add help wording for cluster SLOT-STATS.

Signed-off-by: Kyle Kim <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…valkey-io#822)

The cluster configuration file is the metadata "database" for the
cluster. It is best to trigger a save when shutdown the server, to
avoid inconsistent content that is not refreshed.

We save the nodes.conf whenever something that affects the nodes.conf
has changed. But we are saving nodes.conf in clusterBeforeSleep, and
some events may save it without a fsync, there is a time gap.

And shutdown has its own save seems good to me, it doesn't need to
care about the others.

At the same time, a comment is added in unlock nodes.conf to explain
why we actively unlock when shutdown.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Print the full client info by using catClientInfoString, the
info is useful when we want to identify the source of request.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…updated in `clusterProcessGossipSection` (valkey-io#965)

`clusterProcessGossipSection` currently doesn't trigger a check and call `replicationSetPrimary` when `myself`'s primary node’s IP/port is updated. This fix ensures that after every node address update, `replicationSetPrimary` is called if the updated node is `myself`'s primary. This prevents missed updates and ensures that replicas reconnect properly to maintain their replication link with the primary.

Signed-off-by: Ping Xie <[email protected]>
Minor cleanup, introduced in valkey-io#877.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…ation (valkey-io#989)

Prior to comparing the replica buffer against the configured limit, we
need to ensure that the limit configuration is enabled. If the limit is
set to zero, it indicates that there is no limit, and we should skip the
buffer limit check.

---------

Signed-off-by: naglera <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
valkey-io#886)

If we modify aof-use-rdb-preamble in the middle of rewrite,
we may get a wrong aof base suffix. This is because the suffix
is concatenated by the main process afterwards, and it may be
different from the beginning.

We cache this value when we start the rewrite.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
In RdbLoad, we disable AOF before emptyData and rdbLoad to prevent copy-on-write issues. After rdbLoad completes, AOF should be re-enabled, but the code incorrectly checks server.aof_state, which has been reset to AOF_OFF in stopAppendOnly. This leads to AOF not being re-enabled after being disabled.
---------

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Fix for valkey-io#997

Root Cause Analysis:
1. Two different jobs (READ and WRITE) may be sent to the same IO
thread.
2. When processing the read job in `processIOThreadsReadDone`, the IO
thread may find that the write job has also been completed.
3. In this case, the IO thread calls `processClientIOWriteDone` to first
process the completed write job and free the COBs
https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L4666
4. If there are pending writes (resulting from pipeline commands), a new
async IO write job is sent before processing the completed read job
https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L2417
When sending the write job, the `TLS_CONN_FLAG_POSTPONE_UPDATE_STATE`
flag is set to prevent the IO thread from updating the event loop, which
is not thread-safe.
5. Upon resuming the read job processing, the flag is cleared,
https://github.com/valkey-io/valkey/blob/affbea5dc1ae1a0d80019c4f313d2bf1c3fcb7f9/src/networking.c#L4685
causing the IO thread to update the event loop.

Fix:
Prevent sending async write job for pending writes when a read job is
about to be processed.

Testing:
The issue could not be reproduced due to its rare occurrence, which
requires multiple specific conditions to align simultaneously.

Signed-off-by: Uri Yagelnik <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
Maybe partially resolves valkey-io#952.

The hostnames test relies on an assumption that node zero and node six
don't communicate with each other to test a bunch of behavior in the
handshake stake. This was done by previously dropping all meet packets,
however it seems like there was some case where node zero was sending a
single pong message to node 6, which was partially initializing the
state.

I couldn't track down why this happened, but I adjusted the test to
simply pause node zero which also correctly emulates the state we want
to be in since we're just testing state on node 6, and removes the
chance of errant messages. The test was failing about 5% of the time
locally, and I wasn't able to reproduce a failure with this new
configuration.

---------

Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…valkey-io#995)

While doing some profiling, I noticed that getKeySlot() was a fairly
large part (~0.7%) of samples doing perf with high pipeline during
standalone. I think this is because we do a very late check for
server.cluster_mode, we first call getKeySlot() and then call
calculateKeySlot(). (calculateKeySlot was surprisingly not automatically
inlined, we were doing a jump into it and then immediately returning
zero). We then also do useless work in the form of caching zero in
client->slot, which will further mess with cache lines.

So, this PR tries to accomplish a few things things.
1) The usage of the `slot` name made a lot more sense before the
introduction of the kvstore. Now with kvstore, we call this the database
index, so all the references to slot in standalone are no longer really
accurate.
2) Pull the cluster mode check all the way out of getKeySlot(), so
hopefully a bit more performant.
3) Remove calculateKeySlot() as independent from getKeySlot().
calculateKeySlot used to have 3 call sites outside of db.c, which
warranted it's own function. It's now only called in two places,
pubsub.c and networking.c.

I ran some profiling, and saw about ~0.3% improvement, but don't really
trust it because you'll see a much higher (~2%) variance in test runs
just by how the branch predictions will get changed with a new memory
layout. Running perf again showed no samples in getKeySlot() and a
reduction in samples in lookupKey(), so maybe this will help a little
bit.

---------

Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
…lkey-io#1025)

Update client help output message for new command: client capa redirect

---------

Signed-off-by: hwware <[email protected]>
Signed-off-by: Binbin <[email protected]>
Co-authored-by: Binbin <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
The node may not be able to initiate an election in time due to
problems with cluster communication. If an election is initiated,
make sure its offset is 0.

Closes valkey-io#967.

Signed-off-by: Binbin <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
@PingXie
Copy link
Member Author

PingXie commented Sep 13, 2024

@valkey-io/core-team PTAL.

Will prepare the release notes after merging this PR.

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 81.48148% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (85a5847) to head (89ec1d0).
Report is 33 commits behind head on 8.0.

Files with missing lines Patch % Lines
src/networking.c 14.28% 6 Missing ⚠️
src/cluster_legacy.c 85.00% 3 Missing ⚠️
src/debug.c 0.00% 3 Missing ⚠️
src/server.c 81.81% 2 Missing ⚠️
src/module.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              8.0    #1030      +/-   ##
==========================================
- Coverage   70.61%   70.58%   -0.03%     
==========================================
  Files         114      114              
  Lines       61640    61665      +25     
==========================================
+ Hits        43528    43529       +1     
- Misses      18112    18136      +24     
Files with missing lines Coverage Δ
src/aof.c 80.07% <100.00%> (-0.06%) ⬇️
src/cluster.c 88.38% <ø> (ø)
src/db.c 88.42% <100.00%> (+<0.01%) ⬆️
src/kvstore.c 96.14% <100.00%> (+0.03%) ⬆️
src/pubsub.c 96.36% <100.00%> (+0.01%) ⬆️
src/replication.c 87.08% <100.00%> (+0.17%) ⬆️
src/server.h 100.00% <ø> (ø)
src/module.c 9.64% <0.00%> (ø)
src/server.c 88.63% <81.81%> (+0.04%) ⬆️
src/cluster_legacy.c 85.89% <85.00%> (-0.23%) ⬇️
... and 2 more

... and 10 files with indirect coverage changes

@PingXie
Copy link
Member Author

PingXie commented Sep 13, 2024

Failures are a known issue #970 and it is fixed by #981.

@PingXie PingXie added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Sep 13, 2024
… in advance (valkey-io#981)

Fix timing issue in evaluating `cluster-allow-replica-migration` for replicas

There is a timing bug where the primary and replica have different 
`cluster-allow-replica-migration` settings. In issue valkey-io#970, we found that if 
the replica receives `CLUSTER SETSLOT` before the gossip update, it remains 
in the original shard. This happens because we only process the 
`cluster-allow-replica-migration` flag for primaries during `CLUSTER SETSLOT`.

This commit fixes the issue by also evaluating this flag for replicas in the 
`CLUSTER SETSLOT` path, ensuring correct replica migration behavior.

Closes valkey-io#970
---------

Signed-off-by: Binbin <[email protected]>
Co-authored-by: Ping Xie <[email protected]>
@zuiderkwast zuiderkwast mentioned this pull request Sep 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants