Usually (*), your CPU can address pages (chunks of memory that are assigned to a program) in 4KiB steps. So when it does memory management (shuffle memory pages around, delete them, compress them, swap them to disk...), it does so in chunks of 4KiB. Now, let's say you have a GPU that needs to store data in the memory and sometimes exchange it with the CPU. But the designers knew that it will almost always use huge textures, so they simplified their design and made it able to only access memory in 2MiB chunks. Now each time the CPU manages a chunk of memory for the GPU, it needs to take care that it always lands on a multiple of 2MiB.
If you take fragmentation into account, this leads to all kinds of funny issues. You can get gaps in you memory, because you need to "skip ahead" to the next 2MiB border, or you have a free memory area that is large enough, but does not align to 2MiB...
And it gets even funnier if you have several different devices that have several different alignment requirements. Just one of those neat real-life quirks that can make your nice, clean, theoretical results invalid.
(*): and then there are huge pages, but that is a different can of worms