I'm hosting my own Lemmy instance and trying to figure out how to optimize PSQL to reduce disk IO at the expense of memory.
I accept increased risk this introduces, but need to figure out parameters that will allow a server with a ton of RAM and reliable power to operate without constantly sitting with 20% iowait.
Current settings:
# DB Version: 15
# OS Type: linux
# DB Type: web
# Total Memory (RAM): 32 GB
# CPUs num: 8
# Data Storage: hdd
max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 4
effective_io_concurrency = 2
work_mem = 10485kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
fsync = off
synchronous_commit = off
wal_writer_delay = 800
wal_buffers = 64MB
Most load comes from LCS script seeding content and not actual users.
Solution:
My issue turned out to be really banal - Lemmy's PostgreSQL container was pointing at default location for config file (/var/lib/postgresql/data/postgresql.conf) and not at the location where I actually mounted custom config file for the server (/etc/postgresql.conf). Everything is working as expected after I updated docker-compose.yaml file to point PostgreSQL to correct config file. Thanks @bahmanm@lemmy.ml for pointing me in the right direction!
A few things off the top of my head in order of importance:
How frequently do you VACCUM the database? Have you tried VACCUMing a few of times over a 5 min span & see if there are changes to the disk I/O aftewards?
I've got no idea how Lemmy works but "seeding content", to my mind, possibly means a lot of INSERT/UPDATEs. Is that correct? If yes, there's a chance you may be thrashing your indices & invalidating them too frequently which triggers a lot of rebuilding which could swallow a very large portion of the shared_buffers. To rule that out, you can simply bump shared_buffers (eg 16GB) & effective_cache_size and see if it makes any difference.
Please include a bit more information about PG activity, namely from pg_stat_activity, pg_stat_bgwriter & pg_stat_wal.
You've got quite a high value for max_connections - I don't believe that' s the culprit here.
And finally, if possible, I'd highly recommend that you take a few minutes & install Prometheus, Prometheus node exporter, Proemetheus PG exporter and Grafana to monitor the state of your deployment. It's way easier to find correlations between data points using the said toolset.
I never manually VACUUMed the DB. I just assumed it does it automatically at regular intervals. VACUUMing manually didn't seem to make any difference and gave me the following error after a few minutes of running on various tables:
ERROR: could not resize shared memory segment "/PostgreSQL.1987530338" to 67128672 bytes: No space left on device
I'm not 100% sure where it out of space, but I'm assuming one of the configured buffers since there was still plenty of space left on disk and RAM.
I didn't notice any difference in iowait while it was running or after.
Yes, seeding is mostly inserts, but I see a roughly equal number of selects. I did increase shared_buffers and effective_cache_size with no effect.
I did install Prometheus with PG exporter and Grafana. I'm not a DB expert and certainly not a PostgreSQL expert, but I don't see anything that would indicate an issue. Anything specific you can suggest that I should focus on?
That means too many chunky parallel maintenance workers are using the memory at the same time (max_parallel_maintenance_workers and maintenance_work_mem.)
VACCUMing is a very important part of how PG works; can you try setting max_parallel_maintenance_workers to 1 or even 0 (disable parallel altogether) and retry the experiment?
I did increase shared_buffers and effective_cache_size with no effect.
That probably rules out the theory of thrashed indices.
Since those stats are cumulative, it's hard to tell anything w/o knowing when was the SELECT run. It'd be very helpful if you could run those queries a few times w/ 1min interval and share the output.
I did install Prometheus with PG exporter and Grafana...Anything specific you can suggest that I should focus on?
I'd start w/ the 3 tables I mentioned in the previous point and try to find anomalies esp under different workloads. The rest, I'm afraid, is going to be a bit of an investigation and detective work.
If you like, you can give me access to the Grafana dashboard so I can take a look and we can take it from there. It's going to be totally free of charge of course as I am quite interested in your problem: it's both a challenge for me and helping a fellow Lemmy user. The only thing I ask is that we report back the results and solution here so that others can benefit from the work.
Your Prom PG Exporter is a 404 (I think there is a trailing t in the URL).
Do you have any recommendations for dashboard for grafana/Pg?
As well as statistics that are important?
This is something I'm going to be putting into my deployment, and it's really easy to get overwhelmed with data!
The stock Grafana dashboard for PG is a good starting point. At least, that's how I started. You really should add new metrics to your dashboard if you really need them as you said.
Don't forget to install node-exporter too. It gives some important bits of info about the PG host. Again the stock dashboard is a decent one to start w/.
Interestingly it provides the same config for a 32gb/8core/HDD setup, except for the last 4 lines.
fsync = off
synchronous_commit = off
wal_writer_delay = 800
wal_buffers = 64MB
From my understanding of PG, fsync and synchronous commit are both likely to reduce data integrity by speeding up writes. It will allow PG to send file/page changes into the void of the OS without waiting for confirmation that it's happened successfully.
The wal options there, I do not know about.
wal_writer_delay (default 200) 800
After flushing WAL the writer sleeps for the length of time given by wal_writer_delay, unless woken up sooner by an asynchronously committing transaction.
I guess this gives the OS more time to deal with other PG operations before writing any new data to disk.
wal_buffers (default -1 auto) 64mb
The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers
The default auto setting would have assigned 250mb for wal_buffers. Not sure why this is explicitly set, or what the benefit of reducing it is
I think I was hoping there's s magic setting that would allow psql to operate more like Redis that uses ram for everything until it dumps it to disk at specific intervals.
I wouldn't take anything I say as a recommendation. I'm learning, too. And was hoping to start a conversation (or get corrected).
I should've referenced the actual docs. Google directed me to some 3rd party bullshit.
The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment, typically 16MB. This value can be set manually if the automatic choice is too large or too small, but any positive value less than 32kB will be treated as 32kB. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ bytes, typically 8kB. This parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.
So, it's more about concurrent client writes... I guess?
For this purpose I would search for Linux specific thing - a RAM based storage that receives data writes and even flushes/fsycs and then lazily writes them to HDD/SSD. There would be problem in case of power loss, but the gained performance... Unfortunately, I don't know any such tool.
I guess I could just use rsync to periodically sync RAM drive to disk or just rely on backups to restore running state in case of failure or just a restart. On a mostly idle server with few users this could probably work, but I don't think I'm quite ready for such a risky setup. Server is still perfectly usable at 15% iowait - i was just hoping I could reduce it with mechanisms built into PSQL. Appreciate the suggestion though.
Edit:
I just want to say this would be an awesome feature for Docker/Podman to have an in-memory volume that syncs to disk either periodically or on container termination as an option.