This was neccessary in order to make the memory available via GDRcopy
when multiple small allocations were made. cudaMalloc() would return
multiple memory chunks located in the same GPU page, which GDRcopy
pretty much dislikes (`gdrdrv:offset != 0 is not supported`).
As a side effect, this will keep the number of BAR-mappings done
via GDRcopy low, because they seem to be quite limited.
This is still very simple. Only really free memory, when all allocation
have been deallocated so we only need to keep track of the current
number of allocations.
Using CUDA, memory can be allocated on the GPU and shared to peers on
the PCIe bus such as the FPGA. Furthermore, the DMA on the GPU can also
be used to read and write to/from other memory on the PCIe bus, such as
BRAM on the FPGA.
It is probably too costly to do (and verify) it on every read
or write. Furthermore, the user knows better how to make a certain
memory available to the DMA.
This is used for translations that don't use VFIO which used to bridge
the PCIe address space by creating direct mappings from process VA to
the FPGA. When we want to communicate directly via PCIe without the
involvment of the CPU/VFIO, we need the proper translations that are
configured in the FPGA hardware.
This is a workaround until we have a better heuristic (maybe shortest
path?) to choose between multiple paths in the graph. Since the (abstract)
graph has no idea about memory translations, getPath() may even yield
paths that are no valid translation because a pair of inbound/outbound
edges must not neccessarily share a common address window, but from the
perspective of the abstract graph present a valid path.
The callback function is used by the MemoryManager to verify if a path
candidate represents a valid translation.