It’s not necessary to talk about development work, which is the same as writing code in general. This article is mainly about my operation and maintenance process related issues, I also recommend a few good tools for you to use.
The first half of this article is a compilation of some of the O&M issues I encountered at work, The second half is about Termux, a small tool.
My Operation and Maintenance Process
From my point of view, some of the tasks in operations are actually the ones that cannot be automated. If we had better infra, If our code is better and more automated, then the work of Ops can be reduced gradually.
Ops work can be divided into a variety of situations, and the following is a list I encountered at work.
By degree of urgency
Not Urgent, these transactions are mainly some errors such as timing script timeout, usually we will selectively ignore. Some scripts need to manipulate the database, it may be slow. It’s alright as long as the two scripts do not run together. We usually receive the most alerts.
General emergency, such as some containers of CPU, memory these exceeded the quota because of suddenly traffic spike, A good system could automatically deal with this problem, but manually adjusted is also fine. At this time, we just adjust the threshold to block the alarm.
Emergency, in some cases, the machine will appear to be dead. At this time, you usually have to contact the SA to reboot the machine. We can’t actually do anything about it, but try to reduce the number of crashes and improve stability. In my opinion, crashes are not particularly urgent, because Nginx will block the nodes that crash, and immediately switch to other nodes. Some services may be suspended for a while and then recovered.
Especially Urgent, is a problem of our platform, such as a sudden failure to deploy, or a problem with a deployment program on one of the machines. Although there is no impact on the existing service, it is really related to the stability of the service,
By processing method
Alarms that can be blocked directly can be done directly in the admin interface.
Temporary blocking, application redeployment or other actions can also be done from the admin interface.
You need to use ssh to connect to the machine, similar to checking after a crash and reboot, or fixing if you can’t deploy.
By whom to deal with
Platform administrators, all the problems encountered above the admin should know
The developer or the user of PaaS system, If the application traffic is too heavy, it is always necessary to notify the developer to increase the machine to handle it
SA, as mentioned above, the problem of crashing and so on that the administrator can’t deal with
DevOps pains
I’m both the developer and the maintainer of the system, which is a kind of simple DevOps. Actually, the development is fine, I just need to work on it during office hours. But maintenance may take some time out of my life, and, basically, it’s impossible to avoid.
Office hours
I was not assigned a specific time to finish the task, because I had the operation and maintenance work in.
Suddenly there will be a problem It takes a day or half a day to deal with, development progress is sometimes not not so easy to control
.
No office hours
There are no specific development tasks, but you also need to be on call, similar to the Chinese New Year, and you need to take turns with your colleagues in case there are no problem solvers.
Termux
During the no office hours, I could not always take my laptop. Problems need to connect to the machine is really quite troublesome, and then I thought of two ways:
- Find a cyber cafe, just need a simple configuration of the SSH environment
- Install ssh in my smartphone, small screen but just work.
Well, I think the Termux is the best terminal solution in Android smartphones. I have also tried the JuiceSSH, but I could not even copy a secret key to it.
Termux’s main function is not actually ssh, it is a subsystem-like environment, presumably using chroot
,
It’s just a way to install ssh tools.
Intall it
htop
ssh connection
OpenSSH supports ssh_config file, you can refer to My OpenSSH Series to configure environment
PS
I’m looking for a part-time job in Ops and I’m very familiar with Linux systems and Kubernetes.