More troubleshooting was done today. What did we do:
Yesterday evening @phiresky@phiresky@lemmy.world did some SQL troubleshooting with some of the lemmy.world admins. After that, phiresky submitted some PRs to github.
We started using this image, and saw a big drop in CPU usage and disk load.
We saw thousands of errors per minute in the nginx log for old clients trying to access the websockets (which were removed in 0.18), so we added a return 404 in nginx conf for /api/v3/ws.
We updated lemmy-ui from RC7 to RC10 which fixed a lot, among which the issue with replying to DMs
We found that the many 502-errors were caused by an issue in Lemmy/markdown-it.actix or whatever, causing nginx to temporarily mark an upstream to be dead. As a workaround we can either 1.) Only use 1 container or 2.) set proxy_next_upstream timeout;max_fails=5 in nginx.
Currently we're running with 1 lemmy container, so the 502-errors are completely gone so far, and because of the fixes in the Lemmy code everything seems to be running smooth. If needed we could spin up a second lemmy container using the proxy_next_upstream timeout;max_fails=5 workaround but for now it seems to hold with 1.
And thank you all for your patience, we'll keep working on it!
Oh, and as bonus, an image (thanks Phiresky!) of the change in bandwidth after implementing the new Lemmy docker image with the PRs.
Edit So as soon as the US folks wake up (hi!) we seem to need the second Lemmy container for performance. So that's now started, and I noticed the proxy_next_upstream timeout setting didn't work (or I didn't set it properly) so I used max_fails=5 for each upstream, that does actually work.
This is why having a big popular instance isn't all bad. It helps detect and fix the scaling problems and inefficiencies for all the other 1000s of instances out there!
You guys had better quit it with all this amazing transparency or it's going to completely ruin every other service for me. Seriously though amazing work and amazing communication.
Also to other people: DONATE TO FOSS PROJECTS. If 50.000 people donate only 0.5€, we have 25.000€ for funding the servers, coding, motivating/ people etc.
Just don't take a cup of coffee for 1 day. We are already 2 millions in Lemmy instances. We can build a decentralized world together!!
Boy does it feel good to have those reports and understand the work you guys do. It's really inspiring! Thanks for your hard work, everything has been silk smooth! This instance is really great, Lemmy and its devs are really amazing and I feel at home in a nice, cozy community.
This is why I love open source. The fact that a community can directly debug the code that's it's being hosted on and directly contribute the improvements back is just wild. Thanks for all the hard work @ruud@lemmy.world and the rest of the lemmy.world team! The site already feels much more responsive.
Literally a night and day difference in performance and stability! Thank you all for the hard work. To other users like me, consider reducing or replacing one of your lesser used subscriptions and directing that money to Lemmy. It’s much better served here if you ask me.
Good to see a heavy production server taking on the scaling issues. Thank you! To discuss Lemmy performance issues, there is a dedicated community: !lemmyperformance@lemmy.ml
It now feels pretty good to browse and it now makes the experience of using Lemmy much more enjoyable. Having to spam the vote buttons was really annoying.
Even though i'm not from this instance, this is such a nice way of keeping the users posted about changes.
I wish more companies (I know this is not a company) went straight to the point, instead of using vague terms like "improved stability, fixed few issues with an update" when things are changed. I hope all instance owners follow this trend.
Can we have an update on the status of Lemmy.world and how close ties we are going to have with Meta's threads? Threads is going to support ActivityPub, but time has shown that this is an attempt to try to kill this open platform and eventually replace it with theirs once they get everyone in their ecosystem. (Embrace, Extend...extinguish) Mastodon has said today that they don't mind sleeping with vipers when their demise / dissolution is in Meta's best interest.
Please tell me we are defederating from Meta....or let us know what to expect
EDIT: I originally stated that Mastodon told them to fuck off, but I got confused with Fosstodon (who did that). Mastodon doesn't mind being in bed with Meta
How great is it to be a part of history in the making -
This is Web 3 in its fomenting -
Headlines ~5yrs:
The ending of Web 2 was unceremonious and just ugly. u/spez and moron@musk watched as their social media networks signaled the end of Web 2 and slowly dissolved. Blu bird’s value disintegrated and Reddit’s hopes for IPO did likewise. Twitter and Reddit dissolved into odorous flatulence as centralization fell apart to the world’s benefit. Decentralized/federated social media such as Mastodon and Lemmy made their convoluted progress and led Web 3’s development and growth…
This is how history is made, it’s ugly and convoluted but comes out sweeet…
Lemmy's devs and the .world admins have done in a month what Reddit hasn't done in it's whole existence: having a smooth and almost bug-free experience.
I'm very curious: does single Lemmy instance have the ability to horizontally scale to multiple machines? You can only get so big of a machine. You did mention a second container, so that would suggest that the Lemmy software is able to do so, but I'm curious if I'm reading that right.
Shouldn't the correct HTTP status code for a removed API be 410? 404 indicates the domain wasn't found or doesn't exist, 410 indicates a resource being removed
Awesome work - things seem to be running much more smoothly today.
Do you have anything behind CDN by chance? Looking at the lemmy.world IPs, the server appears to be hosted in Europe and web traffic goes directly there? IPv4 apparently seems to be resolving to a Finland-based address, and IPv6 apparently seems to be resolving to a Germany-based address.
If you put the site behind a CDN, it should significantly reduce your bandwidth requirements and greatly drop the number of requests that need to hit the origin server. CDNs would also make content load faster for people in other parts of the world. I'm in New Zealand, for example, and I'm seeing 300-350 ms latency to lemmy.world currently. If static content such as images could be served via CDN, that would make for a much snappier browsing experience.
Whilst I'm aware that too many users on one instance can be a bad thing for the wider Fediverse, I think it is a great thing at the moment in terms of how well people are banding together to fix the issues being encountered from such a surge in users.
The issues being found on lemmy.world results in better lemmy instances for everyone and improves the whole Fediverse of lemmy instances.
I'm very impressed with how well things are being debugged under pressure, well done to all those involved 👏
I'd volunteer to be a technical troubleshooter - very familiar with docker/javascript/SQL, not super familiar with rust - but I'm sure yall also have an abundance of nerds to lend a hand.
I just love the transparancy you guys are coming forward with. It's absolutely awesome! Thank you for that and for all the work you put in. It means a lot to me that you folks are taking the time to keep us updated. Much love!
It blows my mind with the amount of traffic you guys must be getting that you are only running one container and not running in a k8s cluster with multiple pods (or similar container orchestration system)
Edit: misread that a second was coming up, but still crazy that this doesn’t take some multi node cluster with multiple pods. Fucking awesome
You know there's something about dealing with the lagginess in the past few days makes me appreciate the fast and responsive of the update. It nice to see the community grows and makes the experience at Lemmy feels authentic.
Really great job, guys! I know from my experience in SRE that these types of debugs, monitoring and fixes can be much pain, so you have all my appreciation. I'm even determined to donate on Patreon if it's available
It felt like I’d jinx us all if I commented but THANK YOU! This has been a wonderful experience today. Absolutely loving it and knew you just needed some time to work out the kinks that happen with fast growth.
I like that the post goes in detail and allows us tech nerds to get hard watching this stuff instead of the regular corpo jumbo change log that consists of:
I hope to start on some small contributions sometime next week. Stability has been noticeably better the last few days and I imagine it’s only going to get better.
A lot of this stuff is pretty opaque to me, but I still read through it because I love how detailed they are in sharing what's going on under the hood and how it relates to problems users experience. Kudos to you guys!
Would HAProxy work better as a load balancer? For work we switched due to some issues with NGINX; so far, the service has been much more consistent with pretty much no downtime, even when restarting server hosts.
Compared to days prior, things are running much better today. Page load speed, reliable post/reply/upvoting. I dunno if it's just happenstance but whatever knobs, levers and keystrokes you're manipulating, keep going the things! Thanks so much for a home absent of corporate BS.
Man I thought I noticed something different. For the past week or so I've gotten nothing but network error and Java errors in Jerboa which are completely gone now. Posts load almost instantly too. Appreciate the effort guys, was going insane.
I'll be honest I don't know what any of this means but what I can say is I absolutely love the transparency of all of this. It's so refreshing and maybe I'll start learning more about what I'm looking at because I'll keep seeing it. Great work!
Thank you so much for the hard work, time and money you spend into making lemmy.world run very smoothly. This much transparency is awesome for something that's being used so massively.
The instance seems to be much better. Posting and commenting is not taking as long and loading times are way better. I hope things can stay this good or even get better.
Thank you so much! I will be donating a few cappuccinos your way when my next check arrives. I really appreciate how awesome of a community you’ve brought together & all of the transparency with the updates (and the frequency) is astounding! Keep up the great work but don’t forget to take breaks :)
Have you looked into possibly migrating to kubernetes or some other form of docker container management/orchestration system to help with automatic scaling and load balancing?
Everything is feeling great so far. The only bug I'm encountering is that when opening a thread (in Firefox on desktop) it auto-scrolls down past the content to the replies.
Please recommend people to update their app in a topic title. Connect couldn't even load a topic without failing out today. An update fixed it, but I had to manually force it because it didn't apply automatically.
This will drive people away. Literally none of the communities I subscribe to on world even seemed to have a new comment.
Minor thing but over night both wefwef and Memmy clients are showing the wrong comment score (karma) against my profile, and given they are showing the same amount I assume it’s related to API fed data. Value was correct yesterday. Easy for me to confirm given I have only 2 dozen posts and the value has dropped to single digits.
Not a biggie, but figured I’d report it in case there was some issue causing that. Might be some optimisation around indexing or something has intentionally or unintentionally impacted that.
Otherwise the service feels much more stable currently. No timeouts today where it’s been very frequent the past few days. Nice job. 👍
damn bro, y’all coming in clutch to improve stability of this lemmy instance.
Good shit bros. Hope to contribute upstream and find more performance related bugs. I browsed the code for lemmy, and could not find any performance tests.
Really appreciate all the hard work going on behind the scenes! Feels night and day different after the changes. Also appreciate the transparency. Nice to see in this day and age.
I can't imagine the amount of work Lemmy's devs and ITs are under since few days, but those are important for the future of Lemmy. Keep the good work! You're awesome!
The 502 error still seems to be common by me. Vut they are less permanent. Before it stayed multiple refreshes now it is safe to say after 1 reload it is most of the time gone.