How to self-host a highly available git server cluster?
Edit: it seems like my explanation turned out to be too confusing. In simple terms, my topology would look something like this:
I would have a reverse proxy hosted in front of multiple instances of git servers (let's take 5 for now). When a client performs an action, like pulling a repo/pushing to a repo, it would go through the reverse proxy and to one of the 5 instances. The changes would then be synced from that instance to the rest, achieving a highly available architecture.
Basically, I want a highly available git server. Is this possible?
I have been reading GitHub's blog on Spokes, their distributed system for Git. It's a great idea except I can't find where I can pull and self-host it from.
Any ideas on how I can run a distributed cluster of Git servers? I'd like to run it in 3+ VMs + a VPS in the cloud so if something dies I still have a git server running somewhere to pull from.
Before you can decide on how to do this, you're going to have to make a few choices:
Authentication and Access
Theres two main ways to expose a git repo, HTTPS or SSH, and they both have pros and cons here:
HTTPS
A standard sort of protocol to proxy, but you'll need to make sure you set up authentication on the proxy properly so that only only thise who should have access can get it. The git client will need to store a username and password to talk to the server or you'll have to enter them on every request. gitweb is a CGI that provides a basic, but useful, web interface.
SSH
Simpler to set up, and authentication is a solved problem. Proxying it isn't hard, just forward the port to any of the backend servers, which avoids decrypting on the proxy. You will want to use the same hostkey on all the servers though, or SSH will refuse to connect. Doesn't require any special setup.
Replication
Git is a distributed version control system, so you could replicate it at that level, alternatively you could use a replicated file system, or a simple file based replication. Each has it's own trade-offs.
Git replication
Using git pull to replicate between repositories is probably going to be your most reliable option, as it's the job git was built for, and doesn't rely on messing with it's underlying files directly. The one caveat is that, if you push to different servers in quick suscession you may cause a merge confict, which would break your replication. The cleanest way to deal with that is to have the load balancer send all requests to server1 if it's up, and only switch to the next server if all the prior ones are down. That way writes will alk be going to the same place. Then set up replication in loop, with server2 pulling from server1, server3 pulling from server2, and so on up to server1 pulling from server5. With frequent pulls changes that are commited to server1 will quickly replicate to all the other servers. This would effectively be a shared nothing solution as none of the servers are sharing resources, which would make it easier to geigraphically separate them. The load balancer could be replaced by a CNAME record in DNS, with a daemon that updates it to point to the correct server.
Replicated filesystem
Git stores its data in a fairly simple file structure, so placing that on a replicated filesystem such as GlusterFS or Ceph would mean multiple servers could use the same data. From experience, this sort of thing is great when it's working, but can be fragile and break in unexpected ways. You don't want to be up at 2am trying to fix a file replication issue if you can avoid it.
File replication.
This is similar to the git replication option, in that you have to be very aware of the risk of conflicts. A similar strategy would probably work, but I'm not sure it brings you any advantages.
I think my prefered solution would be to have SSH access to the git servers and to set up pull based replication on a fairly fast schedule (where fast is relative to how frequently you push changes). You mention having a VPS as obe of the servers, so you might want to push changes to that rather than have be able to connect to your internal network.
A useful property of git is that, if the server is missing changesets you can just push them again. So if a server goes down before your last push gets replicated, you can just push again once the system has switched to the new server. Once the first server comes back online it'll naturally get any changesets it's missing and effectively 'heal'.
This is a fantastic comment. Thank you so much for taking the time.
I wasn't planning to run a GUI for my git servers unless really required, so I'll probably use SSH. Thanks, yes that makes the part of the reverse proxy a lot easier.
I think your idea of having a designated "master" (server 1) and having rolling updates to the rest of the servers is a brilliant idea. The replication procedure becomes a lot easier this way, and it also removes the need for the reverse-proxy too! - I can just use Keepalived, set up weights to make one of them the master and corresponding slaves for failover. It also won't do round-robin so no special stuff for sticky sessions! This is great news from the perspective of networking for this project.
Hmm, you said to enable pushing repos to the remote git repo instead of having it pull? I was going create a wireguard tunnel and have it accessible from my network for some stuff but I guess it makes sense.
Thank you. I did think of this but I'm afraid this might lead me into a chicken and egg situation, since I plan to store my Kubernetes manifests in my git repo. But if the Kubernetes instances go down for whatever reason, I won't be able to access my git server anymore.
I edited the post which will hopefully clarify what I'm thinking about
I would have a standalone Forgejo server to act as your infrastructure server. Make it separate from your production k8s/k3s environment.
If something knocks out your infrastructure Forgejo instance then your prod instance will continue to work. If something knocks out your prod, then your infrastructure instance is still there to pull on.
One of the reasons I suggest k8s/k3s if something happens k8s/k3s will try to automatically bring the broken node back online.
Apologies for not explaining better. I want to run a loadbalancer in front of multiple instances of a git server. When my client performs an action like a pull or a push, it will go to one of the 5 instances, and the changes will then be synced to the rest.
I have edited the post to hopefully make my thoughts a bit more clear
I wonder if you could use HAProxy for that. It's usually used with web servers. This is a pretty surprising request though, since git is pretty fast. Do you have an actual real world workload that needs such a setup? Otherwise why not just have a normal setup with one server being mirrored, and a failover IP as lots of VPS hosts can supply?
And, can you use round robin DNS instead of a load balancer?
Wouldn't it be better to have highly available storage for the git repo?
Something like Ceph, Minio, Seaweedfs, GarageFS etc.
Cause git is file system based.
So, to be clear, GitHub is not git. Git is intrinsically distributed. GitHub is basically a repository Management service.
I did some googling for about 10 seconds and afaik GitHub does not support any type of self hosting. I know you can selfhost gitlab , but I don't see a project for either GitHub or gitlab called spokes.
Not knowing anymore than this about what you actually want to accomplish, my advice would be to just figure out how to run your own git server (without the management fluff) and do a 3-2-1 backup scheme. You could of course also create a gitlab instance with an HA set-up, plus backing that up to the cloud.
GitHub didn't publish the source code for their project, previously known as DGit (Distributed Git), now known as spokes. The only mention of it is in a blog post on their website but I don't have the link handy right now
Apologies for not explaining it properly. Essentially, I want to have multiple git servers (let's take 5 for now), have them automatically sync with each other and run a loadbalancer in front. So when a client performs an action with a repository, it goes to one of the 5 instances and the changes are written to the rest.
I have edited the post, hopefully the explanation makes more sense now
Have you considered a distributed filesystem such as GlusterFS or DRBD? I believe those support synchronous replication so writes will go to all the configured machines before acknowledging the write. Performance will likely take a hit the greater the number of clusters in the cluster.