So today I discovered that there's a cron job that holds non-reproducible state that died, and now our system is fucked.
The cron job doesn't live inside any source control. This morning it entered a terminal state, and because it overwrites its state there's no way to revert it.
I'm currently waiting for the database rollback and have rewritten it in a reproducible/idempotent way.
This is almost exactly what happened to me on Monday, resulting in a fifteen hour day.
My particular jenga piece was an Access query that none of my predecessors had deigned to document or even tell me about... but was critical to run monthly or you had obsolete data embedded deep within multi-million dollar reports.
Thank god I don't work on salary anymore, or I'd be really upset.
Idempotent code/repositories are great - I love making everything as reproducible as possible. Particularly in make where every 'all' type command should have a corresponding 'clean' command. Many times I'll see code bases where they skip defining the 'clean' command... or worse, have no 'all' command to begin with and rely on the developer knowing all the build and environment setup commands...
Yeah, I don’t consider most code complete unless it’s safe and reproducible. I love make, currently using npm but you can set up scripts with it. Automating the build process was the very first thing I did.
This project is a piece of work. There’s effectively no documentation, and every now and then I find something new like this. The stuff I’ve fixed up so far has been much much more reliable and performant.
Part of me just wants to rewrite the whole thing, but I need to ship features so we can sell the product and pay my salary.
At least I’m not a cog in a huge corporation getting my soul crushed every day. I actually love fixing weird stuff.
Turns out there was a second bug which triggered this one, and a bug I found in this script that I thought was responsible was happening silently for months.
Had a similar thing once. Some how, some way, the DBA copied and pasted something wrong. Oracle DB had some odd extra syntax for left and right joins that other DBs didn't (or at least that I'd never seen). My best guess is that he auto formatted out of habit and maybe it took those symbols out.
It took a long time to find that. Because the only evidence something was wrong was that ONE of our customers wasn't being billed for ONE product. Everyone else was fine. Basically they were using it in a very atypical way. The left joins made sure to include them in the billing even because they didn't have whatever was on the right of that join. Everyone else did.
We never had our crons in source control, but I always saved it somewhere (usually on my machine and the target machine) so we had some history just in case of typing r instead of l for some reason. You can also create an alias called backupCrontab or something that runs the command for you and puts the output somewhere safe.