ConfigCheck to the rescue
- Adrian Penișoară
- Jun 1, 2015
- 3 min read
Updated: Mar 26, 2024
A swiss-army knife to cut through the fog of multi-managed servers clout.
How many times have you been in the situation where you were thinking:
“Yet another outage caused by another operational team's maintenance intervention having unforeseen side effects. How can we escape this hell ?”
You're probably not the first one to have this happened to him. In this world there are probably thousands of such incidents going every day, with small harmless interventions by one team having terrible (in)direct consequences for another, maybe triggering another outage of your precious services in production.
How the story goes
For a long time we had been haunted ourselves by these pesky troubles. In a perhaps familiar scenario, our team was in charge of an application service in Production but our access was limited to mostly read-only access to the logs of the application, with various other operational teams owning the rest of the layers (take your pick: Unix team, middleware team, network team and so forth).
As the usual story would go, here and there, out of the blue, we would see our mission critical service misbehave or, in some of our worst nightmares, even go down. Only to find out later on, during RCA (Root Cause Analysis) that some other operational team had a recent intervention on the same machine which either would have left it in a bad state for the application ("What, that was running on the server ? Was I supposed to start that ?") or, even more insidiously, some minute change would be done and go unnoticed for a long time, only for it to trigger some abysmal failure in another part of the system at the wrong moment (hint: restarting a system is a prime candidate).
Yes, we've been there, felt the pain. :'(
Enter ConfigCheck
Since this tended to consume a lot of the energy (and morale!) of our team, something had to be done. As our hands were tied up (no access to monitoring, no access to installing packages, no binaries either), we deployed ConfigCheck on the machines in question -- a small shell script that was tasked with detecting and tracking changes on the machine, simply being launched from the application user's crontab on a periodic basis.
Based on previous experienced pain points, we selected specific configuration files and certain commands output for which the tool would announce if it detected changes, as well as retain a history of their state over time. Needless to say, nowadays we have a far better grasp of what's going on on the machine, sometimes even surprising the other "prime-time" operational teams with proactive tickets before they even notice something going wrong around them.
Some of the key points for the tool are its portability and usability: since it's written in plain Bourne shell (the original /bin/sh, not the Linux modern/GNU replacement bash substitute), the tool is practically able to run almost anywhere as this interpreter is virtually guaranteed to be present on any Unix or Unix-like machine. Furthermore, since it doesn't require much dependencies or installing new binaries, it can be very easily and quickly deployed in the home directory of a plain user on the machine, as it does not expect to use hardcoded filesystem paths (by default it loads its configuration files from the same location as where the script is run).
This is one epitome of the Unix philosophy: able to do its job in a simple but very reliable way. We now have just to hope the human counterparts would evolve the same way... :)
Comments