Basic Recovery for a Neo4j Causal Cluster running in K8s
May 29, 2022 | DevOps | 1 Comment
Neo4j is (kind of) the only solution when a graph database is in need. And when it comes to production grade, in which resilience is required, a Neo4j Causal Cluster is a must.
The Neo4j Causal Cluster comprises two types of servers:
- Core Server – acts as read/write and forms the initial cluster. A Core Server can act as a Leader (write) or Follower (read), which varies per database.
- Read Replica – acts as read-only and is considered disposable. A Read Replica can be removed/added with no “damage” to the cluster.
In terms of resilience, the Neo4j Causal Cluster can sustain the loss of Core-Servers. In this case, another Core-Server will become a leader for those databases led by the lost server.
Having said all that, maintaining a Neo4j Causal Cluster may not always be the easiest task. Especially when actual data is involved and data loss is not an option. So it is very important to have a basic standard approach for attending an issue in your Neo4j Causal Cluster. I would like to share with you some pointers and courses of action which I found to be very useful.
You should note that the first steps you take are super-critical and have a great effect on the overall outcome, so make sure you know exactly what you are doing and why you are doing it. Failing to do so may often result in a more severe issue than the issue which caused us to attend to the cluster in the first place.
The first step is to understand the problem
- Go to the Neo4j UI or open cipher-shell
- Run :sysinfo
- Check if all databases have an online status
- If any of the databases are offline or have exceptions or errors, try to understand what they mean (not always as trivial as it may sound)
If you were able to identify the problem, good for you! You should work to resolve it.
If you were not able to identify the problem (i.e, no errors or exceptions and all databases are online), fear not! It might be that your servers just went out of sync.
After you resolved the underlying issues (or decided its a sync issue), you should Re-Sync The Cluster
- Put all servers in “Offline Maintenance Mode”
- Unbind all of them from the cluster (run the unbind command on each server)
- Take all servers out of “Offline Maintenance Mode”
- Wait for the cluster to re-form
- Run :sysinfo and check if all databases are online
1 Comment
Guidelines for Running a Neo4j Causal Cluster in K8s - MugzTech by Omri Mugzach
[…] Basic Recovery for a Neo4j Causal Cluster running in K8s […]