I wonder what kind of computing resources that Microsoft service needs. Isn't it essentially just a set of hashes? My point being that centralization does not have to be an issue.
It's a bit of an unknown, since the service is a proprietary black box. With that being said, my guess:
A database with perceptual hash data for volumes and volumes of CSAM.
Means to generate new hashes from media
Infrastructure for adding and auditing more of it
REST API for hash comparisons and reporting
Integration for pushing reports to NCMEC and law enforcement.
None of those things are impossible or out of reach...but, collecting a new database of hashes is challenging. Where do you get it from? How is it stored? Do you allow the public to access the hash data correctly, or do you keep it secret like all the other solutions do?
I'm imagining a solution where servers aggregate all of this data up to a dispatch platform like the one described above, possibly run by a non-profit or NGO, which then dispatches the data to NCMEC directly.
The other thing to keep in mind is that solutions like photoDNA are HUGE. I'm talking like hundreds of thousands of pieces of reported media per year. It's something that would require a lot of uptime, and the ability to handle a significantly high amount of requests on a daily basis.
I've been thinking: CSAM is just one of the many problems communities face. E.g. Youtube is unable to moderate transphobia properly, which has significant consequences as well.
Let's say we had an ideal federated copy of the existing system. It would still not detect many other types of antisocial behavior. All I'ms saying is that the existing approach by M$ feels a bit like it's based on a moral tunnel vision and trying to solve complex human social issues by using some kind of silver bullet. It lacks nuance. Whereas in fact this is a community management issue.
Honestly I feel it's really a matter of having manageable communities with strong moderation. And the ability to report anonymously, in case one becomes involved in something bad and wants out.
IMO the hardest part is the legal side, and in fact I'm not very clear how MS skirted that issue other than through US lax enforcement on corporations. In order to have a db like this one must store stuff that is, ordinarily, illegal to store. Because of the use of imperfect, so-called perceptual hashes, and in case of algorithm updates, I don't think one can get away with simply storing the hash of the file. Some kind of computer vision/AI-ish solution might work out, but I wouldn't want to be the person compiling that training set...
Perhaps the manual reporting tool is enough? Then that content can be forwarded to the central ms service. I wonder if that API can report back to say whether it is positive.
Can you elaborate on the hash problem?
Personally I was thinking of generating a federated set based on user reporting. Perhaps enhanced by checking with the central service as mentioned above. This db can then be synced with trusted instances.