Was this page helpful?
Enable Consistent Topology Updates¶
This article explains how to enable consistent topology changes when you upgrade from version 2024.1 to 2024.2.
Introduction¶
ScyllaDB Open Source 2024.2 introduces consistent topology changes based on Raft. Clusters created with version 2024.2 use consistent topology changes right from the start. However, consistent topology changes are not automatically enabled in clusters upgraded from version 2024.1. In such clusters, you need to enable consistent topology changes manually by following the procedure described in this article.
Before you start, you must check that the cluster meets the prerequisites and ensure that some administrative procedures will not be run while the procedure is in progress.
Prerequisites¶
Make sure that all nodes in the cluster are upgraded to ScyllaDB Enterprise 2024.2.
Verify that schema on raft is enabled.
Make sure that all nodes enabled
SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES
cluster feature. One way to verify it is to look for the following message in the log:features - Feature SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES is enabled
Alternatively, it can be verified programmatically by checking whether the
value
column under theenabled_features
key contains the name of the feature in thesystem.scylla_local
table. One way to do it is with the following bash script:until cqlsh -e "select value from system.scylla_local where key = 'enabled_features'" | grep "SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES" do echo "Upgrade didn't finish yet on the local node, waiting 10 seconds before checking again..." sleep 10 done echo "Upgrade completed on the local node"
Make sure that all nodes are alive for the duration of the procedure.
Administrative operations that must not be running during the procedure¶
Make sure that administrative operations will not be running while the procedure is in progress. In particular, you must abstain from:
Cluster management procedures (adding, replacing, removing, decommissioning nodes, etc.).
Running nodetool repair.
Running nodetool checkAndRepairCdcStreams.
Any modifications of authentication and authorization settings.
Any change of authorization via CQL API.
Schema changes.
Running the procedure¶
Warning
Before proceeding, make sure that all the prerequisites are met and no forbidden administrative operations will run during the procedure. Failing to do so may put the cluster in an inconsistent state.
Issue a POST HTTP request to the
/storage_service/raft_topology/upgrade
endpoint to any of the nodes in the cluster. For example, you can do it withcurl
:curl -X POST "http://127.0.0.1:10000/storage_service/raft_topology/upgrade"
Wait until all nodes report that the procedure is complete. You can check whether a node finished the procedure in one of two ways:
By sending a HTTP
GET
request on the/storage_service/raft_topology/upgrade
endpoint. For example, you can do it withcurl
:curl -X GET "http://127.0.0.1:10000/storage_service/raft_topology/upgrade"
It will return a JSON string that will be equal to
done
after the procedure is complete on that node.By querying the
upgrade_state
column in thesystem.topology
table. You can usecqlsh
to get the value of the column:cqlsh -e "select upgrade_state from system.topology"
The
upgrade_state
column should be set todone
after the procedure is complete on that node:
After the procedure is complete on all nodes, wait at least one minute before issuing any topology changes in order to avoid data loss from writes that were started before the procedure.
What if the procedure gets stuck?¶
If the procedure gets stuck at some point, first check the status of your cluster:
If there are some nodes that are not alive, try to restart them.
If all nodes are alive, ensure that the network is healthy and every node can reach all other nodes.
If all nodes are alive and the network is healthy, perform a rolling restart of the cluster.
If none of the above solves the issue, perform the Raft recovery procedure. During recovery, the cluster will switch back to the gossip-based topology management mechanism.
After exiting recovery, you should retry enabling consistent topology updates using the procedure described in this document.