DevOps -

Troubleshooting The Right Way

June 2, 2022 | Automation, DevOps | No Comments

How fun is it to be the first one who comes up with the solution! Such a great feeling knowing that you were the one who solved the problem. However, it’s easy to haste and provide a solution to the symptom and not the actual problem. I guess we all do that from time to time.

Most of the times the actual problem hides behind the symptom, and when we focus on providing a fix for the symptom, not only that we might miss the problem (and not fix it at all), we could increase the severity of the problem or hide it even further.

Here are some tips that help me (every day) come up with solutions for the actual problem and not for the symptom:

Try to describe and characterize the problem and not the symptom. Ask yourselves what could cause the symptom (the problem) that you see. Try listing all possible sources you can come up with. Focus on gathering information and describing the issue at hand.
Focus on the “Story”. Ignore the visible issue and focus on the “story”. The story should tell the the full (end-to-end) process that is currently not working. A good evidence that you got the story right is your ability to tell the story to someone else and make them understand it.
Start by assuming you don’t know. When you assume that you don’t know it forces you to investigate more thoroughly and to make sure you understand what you find.
Ask yourselves why. Why a certain outcome came to be? Why ‘X’ happened? What could have caused ‘X’ to occure?
Zoom-Out. After gathering some information try to zoom-out and arrange all the facts that you have into the Story mentioned in section (2). Make sure it is short and clear. This way, if you need assistance, and have to get someone new to speed on this, it will be easy and clear.
Don’t be afraid to ask questions. If you have questions, find the relevant person and go ask the question regardless to how sophisticated (or not) you think your question is. Even if you think or feel that your question is stupid. There is no such thing as a stupid question when you are trying to investigate an issue. You can always throw the word “Production” in. It will definitely help 😁.
Avoid theological and theoretical discussions. Do your best to avoid theoretical discussions around how that process or flow should work, and what should be the best-practice for it. As important as these discussion are, they are irrelevant at that point of time and will only waste your precious time. However, you should make sure these discussions are held right after the issue is resolved.
Verify and validate your suspicions and conclusions before you implement them. Do not settle for thinking only. Make an effort to validate and verify that your suspicions are correct and evident.
Focus on the facts and data that you see. Do not forget to use all the data and facts that you see when constructing your solution. No data or fact should be ignored, as ignoring them might shift you away from the required solution.

Have other useful tips? let me know…

By Omri Mugzach

Kubernetes Cheat Sheet Shortlist

June 1, 2022 | DevOps | No Comments

Context & Namespaces

List All Contexts

Returns the list of contexts from the local kubectl config file

kubectl config get-contexts

Show Current Context

Returns the context currently being used by kubectl

kubectl config current-context

Change Current Context

kubectl config use-context CLUSTER_IDENTIFIER

Set “Sticky” Namespace

use a “sticky” namespace so you won’t have to specify the namespace in every command

kubectl config set-context --current -n NS_NAME

General Commands

Port Forwarding to Pod

Port-forwarding using the same port on the local machine and the target pod

kubectl port-forward -n NS_NAME POD_NAME PORT

Port-forwarding using a different port on the local machine and the target pod

kubectl port-forward -n NS_NAME POD_NAME LOCAL_PORT:TARGET_PORT

Port-forwarding to a service

kubectl port-forward -n NS_NAME svc/SERVICE_NAME PORT

Scaling

kubectl scale OBJECT OBJECT_NAME -n NS_NAME --replicas=DESIRED_COUNT

kubectl scale deployment DEPLOYMENT_NAME -n NS_NAME --replicas=5

Wait For…

Wait for Pod to become ready

kubectl wait -n NS_NAME --for=delete pod/POD_NAME --timeout=90s

Wait for Pod to become deleted

kubectl wait -n NS_NAME --for=condition=Ready pod/POD_NAME --timeout=90s

Run BASH Command On Pod

kubectl exec -n NS_NAME -it POD_NAME -- bash -c "YOUR_COMMANDS;"

SSH to Pod

kubectl exec -n NS_NAME -it POD_NAME-- bash

Force Pod Deletion

kubectl delete pod POD_NAME -n NS_NAME --grace-period=0 --force

Get Pods With Name-Only

kubectl get pods -n NS_NAME --no-headers -o custom-columns=":metadata.name"

Suspend a Cronjob

kubectl patch cronjob CRONJOB_NAME -n NS_NAME -p '{"spec" : {"suspend" : true }}'

By Omri Mugzach

Kafka Cheat Sheet Shortlist

May 30, 2022 | DevOps | No Comments

When using Kafka, whether its a Kubernetes Operator-based (like Strimzi) or a common deployment, there are usually a set of common commands that are required for the basic day-to-day maintenance and operation of the Kafka cluster.

As a firs step you should download the Kafka distribution package. This package holds a set of Bash scripts that runs core (Java-based) Kafka components and enable interaction with the Kafka cluster (i.e creating topics, changing configs, etc.)

When running these Kafka commands, please note that they should be executed against the Zookeeper or the Bootstrap-Server depending on the Kafka version you are using.

Here is a list of commands which in my opinion are the most used and most necessary for the day-to-day operation of a Kafka cluster.

List All Topics

bin/kafka-topics.sh --list --zookeeper SERVER_ADDRESS:2181

bin/kafka-topics.sh --list --bootstrap-server SERVER_ADDRESS:9092

Show Topic Details

./bin/kafka-topics.sh --describe --bootstrap-server SERVER_ADDRESS:9092 --topic TOPIC_NAME

Create a Topic

./kafka-topics.sh --create --bootstrap-server SERVER_ADDRESS:9094 --topic TOPIC_NAME --partitions 2 --replication-factor 2

Change Topic Configuration Values

This command can be used for changing / adding configuration values to topics. This specific example shows how to change the retention ms value.

./bin/kafka-configs.sh --alter --bootstrap-server SERVER_ADDRESS:9092 --add-config retention.ms=1800000 --entity-type topics --entity-name TOPIC_NAME

Drain (purge) a Topic

./bin/kafka-configs.sh --alter --bootstrap-server SERVER_ADDRESS:9092 --add-config retention.ms=0 --entity-type topics --entity-name TOPIC_NAME

Run a Kafka Producer

The Producer process lets us add data to the topics in our Kafka cluster. When running this process, it will open a console for the producer, in which we can type inputs of data to the cluster. Each text input will be added as a separate message to the relevant topic.

bin/kafka-console-producer.sh --bootstrap-server SERVER_ADDRESS:9092 --topic TOPIC_NAME

Run a Kafka Consumer

The Consumer process lets us read data from the topics in our Kafka cluster. when using the from-begining flag we tell the Kafka cluster to let us read all the messages that are currently in the topic (even old messages that were not consumed yet).

When running this command, all read messages will be logged onto the console. If executed alongside a producer, any message that will be produced in the producer console will be logged to the console by the consumer process.

bin/kafka-console-consumer.sh --bootstrap-server SERVER_ADDRESS:9092 --topic TOPIC_NAME --from-beginning

By Omri Mugzach

Guidelines for Running a Neo4j Causal Cluster in K8s

May 29, 2022 | DevOps | No Comments

When it comes to graph databases, no doubt that Neo4j is the shining star.

When it comes to Neo4j in production, no doubt that a Neo4j cluster is probably the best approach.

When it comes to a Neo4j cluster, in my humble opinion, there are some guidelines you should follow.

Make sure you never run out of disk space. This might sound trivial, but recovery from a full disk can be super-painful as it will definitely make your cluster go out of sync and may get your databases quarantined
A Causal Cluster is Causal up to a certain point. When a core server is lost, even if you had enough core servers to support the loss of that server, databases may still go offline anyway and your cluster might go out of sync.
A database might remain quarantined even after the underlying issue was resolved. Usually, databases get quarantined for a reason, and usually a justified one. However, after the underlying issue that caused the database to get quarantined is resolved, sometimes the database remains quarantined (this time with no good reason)
Have enough core servers to afford losses. The formula to calculate the number of core servers that can be lost without damaging the cluster is n=2F+1, when F is the number of servers you are willing to lose and n is the total number of required servers to achieve that.
Have a simple and efficient deployment mechanism. When it comes to clusters, especially in production, the most important and critical aspect is the day-to-day maintenance. And don’t delude yourself, you will have maintenance and lots of it. All the time. So, you must have a super-simple and standardized deployment mechanism to facilitate ad-hoc changes.
Use a customized helm chart. Make sure you control the exact version running in production to avoid surprises. This is a general pessimistic approach I recommend taking with any 3rd party you use. You can never know when something will be changed on the vendor’s side. And you can never know how such a change will affect your application. Better safe than sorry.
Have a separate mechanism for handling volumes. Neo4j Servers are deployed via Stateful-Sets and do not allow volume size changes via the helm chart once deployed. So make sure you have a standardized deployment mechanism for facilitating that. Didn’t I already write that in section 5? Well, it’s important.
Think very carefully before you attend to a cluster issue. The first actions when attending to a cluster issue may have a huge effect on the overall outcome. Read more here.

By Omri Mugzach

Basic Recovery for a Neo4j Causal Cluster running in K8s

May 29, 2022 | DevOps | 1 Comment

Neo4j is (kind of) the only solution when a graph database is in need. And when it comes to production grade, in which resilience is required, a Neo4j Causal Cluster is a must.

The Neo4j Causal Cluster comprises two types of servers:

Core Server – acts as read/write and forms the initial cluster. A Core Server can act as a Leader (write) or Follower (read), which varies per database.
Read Replica – acts as read-only and is considered disposable. A Read Replica can be removed/added with no “damage” to the cluster.

In terms of resilience, the Neo4j Causal Cluster can sustain the loss of Core-Servers. In this case, another Core-Server will become a leader for those databases led by the lost server.

Having said all that, maintaining a Neo4j Causal Cluster may not always be the easiest task. Especially when actual data is involved and data loss is not an option. So it is very important to have a basic standard approach for attending an issue in your Neo4j Causal Cluster. I would like to share with you some pointers and courses of action which I found to be very useful.

You should note that the first steps you take are super-critical and have a great effect on the overall outcome, so make sure you know exactly what you are doing and why you are doing it. Failing to do so may often result in a more severe issue than the issue which caused us to attend to the cluster in the first place.

The first step is to understand the problem

Go to the Neo4j UI or open cipher-shell
Run :sysinfo
Check if all databases have an online status
If any of the databases are offline or have exceptions or errors, try to understand what they mean (not always as trivial as it may sound)

If you were able to identify the problem, good for you! You should work to resolve it.

If you were not able to identify the problem (i.e, no errors or exceptions and all databases are online), fear not! It might be that your servers just went out of sync.

After you resolved the underlying issues (or decided its a sync issue), you should Re-Sync The Cluster

Put all servers in “Offline Maintenance Mode”
Unbind all of them from the cluster (run the unbind command on each server)
Take all servers out of “Offline Maintenance Mode”
Wait for the cluster to re-form
Run :sysinfo and check if all databases are online

By Omri Mugzach

Recent News

Category: DevOps

Context & Namespaces

List All Contexts

Show Current Context

Change Current Context

Set “Sticky” Namespace

General Commands

Port Forwarding to Pod

Port-forwarding to a service

Scaling

Wait For…

Run BASH Command On Pod

SSH to Pod

Force Pod Deletion

Get Pods With Name-Only

Suspend a Cronjob

List All Topics

Show Topic Details

Create a Topic

Change Topic Configuration Values

Drain (purge) a Topic

Run a Kafka Producer

Run a Kafka Consumer