It doesn't, but it very well could. Despite the fact the the neurons are effectively a black box, the output nodes are mapped to functions which are very well defined. You can, after activation of a node but before implementation, simply add a manual filter for particularly negative nodes or remove them entirely. It the equivalent of saying you don't want mario to jump so you filter out any jump button inputs from the neural net.
idk, from what I understand this is the "alignment problem" and it is very nontrivial. Is what you're describing related to the current "reinforcement learning" techniques they use to keep things nice and censored? If so it's been shown to not work very consistently and only functions like suggestions and not actual hard rules, and can backfire.
Not exactly. The alignment problem is an issue with optimization. The network is trained to do somehing, but it may not neccassarily be what you were wanting it to do. The method i referred to doesnt effect the priorities of the network, it simply overides certain outputs that the network may be sending. Imagine a network trained to identify weather conditions and range, and when it has adjusted, it fires a round. Now imagine a second network specifically trained to identify friendly targets. The two networks dont neccassarily need to communicate, and they are completely seperated software wise. It simply that when the second network identifies a target, it prohibits the first network from firing.
By having them not communicate, you solve that it miht fire at friendlies. The issue now becomes that the aiming network thinks its doing its job, and it doesnt understand(because it has no way of knowing) that it cant fire. So it just holds position pointing at the friendly and continuing to attempt to fire.
That does clarify what you mean. Still, that's just composing AI systems that do not operate on discreet logic within systems that do. There is no hard rational guarantee that the aiming model will not misjudge where the bullet will land, or that the target identification model will not misjudge who qualifies as a friendly. It might have passed extensive testing, or been trained on very carefully curated data, but you still have only approximate knowledge of what it will do. That applies even moreso to more open ended or ambiguous tasks.