Skip Navigation

Outage because I made a spelling mistake | Reddthat: June Update

Today I made a new Template for our Go-Away anti-bot protection and am up-streaming the solution so that all of lemmy admins can have a drop in replacement.

When I was creating it I made it for us and by default it looks like this: (The Reddthat one has our icon instead ofcourse)

I named the template file reddthat-challenge.gohtml and when I was making it more 'generic' I renamed it from reddthat and typed lemmy...

Turns out I typed lemmmy instead! It's technically valid, ansible-lint and ansible-playbook --syntax-check both pass without issue! Our test PR deployment also was successful because like all of the other items, everything was correct and the one thing it doesnt do is issue a 'reload' or a dry-run. Even then when issuing a dry-run it says the container will be recreated, which is an expected outcome as I made a change to the go-away proxy, so I expect it to restart...

 
    
--- a/templates/docker-compose.yml
+++ b/templates/docker-compose.yml
@@ -170,7 +170,7 @@ services:
     volumes:
       - "goaway_cache_main:/cache"
       - "./policy.yml:/policy.yml:ro"
-      - "./lemmmy-challenge.gohtml:/lemmy-challenge.gohtml:ro"
+      - "./lemmy-challenge.gohtml:/lemmy-challenge.gohtml:ro"
     environment:
       GOAWAY_CLIENT_IP_HEADER: "X-Real-Ip"


  

Our monitoring was/is were all returning success?!?

This is a screenshot from https://status.reddthat.com/:


(This service is Betterstack)

It was setup to only alert when "the url becomes unavailable" (which is any code outside than 2XX). Previously Cloudflare used to pass through the error code? As this has always alerted me whenever the backend failed?)

Here is a memory graph you can see the red 'swap' be basically static until I tested the template and then the swap came back up again and then at 10:53 (00:53 UTC) it was restarted and never came back!


(This service is hetrixtools)

And here is my own on-site monitoring service showing all green!


(This is uptime-kuma. (Note: This goes directly to the box and not via cloudflare))

I can probably dig deeper into the logs but as I went out to lunch and my commit at 10:52 coincides with the exact 'increase'/lack of increase seen above we can be certain it's related!

A 3 hour outage and I didn't know about it because all 3 uptime monitoring solutions all agreed we were online!

Resolution:

I'm so sorry!

Turns out uptimekuma which is my fallback for the online serivces was setup to monitor the PORT only directly on the box. So as our webserver was still running as far as it was concerned, it was technically "online".

  • I've added a few more checks to uptime-kuma so instead of just being a port check, there is a HTTPS check with keyword. So the page has to load, and the keyword needs to be on the page otherwise I'll get notified.
  • I've updated Betterstack to check only a 200 HTTP code. Which is what Hetrixtools was already set too.
  • I've updated Hetrixtools to also check for "Reddthat" in the page.
  • Hetrixtools also can monitor the docker service, so if the docker service goes down I get a message now.
  • I'm thinking about adding a way to punch into the box from uptimekuma too and monitor the docker container itself, as Hetrixtools doesnt seem to do that, but that might be overkill? Or not? This will probably be the only way in which I can tell?

Even after 15 years (oof now I feel old) doing "IT" I am still amazing at how no matter how many systems you put in place to try and catch issues, assumptions and unfortunate circumstances can rear its ugly head.

Our automated testing all passed.
Our external monitoring solutions passed.
Our internal monitoring solutions passed.
Our syntax checker all passed. Our service still went down.

Cheers,

Happy May!

Tiff


Also happy June? What was I thinking...

The above is probably enough of an update. Things are mostly normal. Lemmy devs are nearing 1.0 release, which will probably mean a big push across the internet and more people joining. (Hopefully picking Reddthat :P ).

I've been struggling to accept all the new people in a timely manner. As I like to sleep for at least 6 hours and don't always wake up and immediately open Reddthat (Shocking I know!), sometimes people end up getting to around the 12 hour mark for their applications to be accepted. Which I can see being a problem, especially when they don't supply an email address. I'm working on a better solution to alert me when there are applications that need processing.

Also did you know I renewed the domain again? That means it's been 3 Years now!

Honestly insane. This Lemmy thing is probably the biggest thing I've done for a community on the internet for a long time! Here's to 3 years more!

<3

Tiff

Donation Thingy:

Note: On Liberapay, donations are paid in advance, but you are more than welcome to make it recurring monthly instead of paying yearly. Don’t worry too much about the “fees”. It’s just the cost of doing business via the credit card duopoly.

💸 “Expenses”:

  • April Costs: ~A$150
  • May Costs: ~A$150

Still tracking around the ~150 mark per month. I managed to cull some assets from our S3 bucket (ie; our old backups and all the dev buckets) to bring our costs down a little bit, so when the June bill come in the next 3 days I'll update it.

⭐ Donation “Statistics”:

  • New Donators in May: 0
  • Lost Donators (Who did not renew): 0
  • Total Weekly: ~A$26.84 (Trending Down)
  • (“Monthly”: 26.84×52÷12 = ~A$116.30)
  • Our Public Donators: <3
  • AppleStrudel
  • asqapro
  • bitwize
  • ~1903711
  • Matthew Fennell

🥅 Goal: 26.84 / 60.00

Want a month dedicated to you? -> https://liberapay.com/reddthat

PS: don’t like fees? Use Crypto (Litecoin/Monero) for even better transaction fees than credit cards for your donation. (See the main sidebar for addresses). And validate them again on liberapay too if you want to ensure I get those dollary doos!

12 comments
12 comments