As discussed in issue #186 and on IM. This function check every 2048th
cycle if the thread should be canceled.
This also removed the need for 'kill -9' in the integration test.
Before this commit, the Infiniband node recreated the address handle for
the remote node during every cycle. Now, it only creates it directly
after it got ah_attr.
* Meta data was not included in the calculation which determines if
sample should be send inline. This caused errors
* Meta data was not substracted from sample->length on receive side
ib_write() and ib_read() now point to the sequence, ts_origin, and format
members of struct sample in a separate scatter/gather element each.
ib_read() measures the time with time_now() (from villas/timing.h) and
sets all flags at receive side.
Closes#152. As described in #182, we will not rearrange the Queue Pairs
for connected mode. As soon as we test many-to-one connections for the
unrealiable connection, we will look again at this issue.
Prior to this commit, we called rdma_disconnect() and waited for a fixed
amount of time. This check was kind of arbitrary. Now, we keep polling
the receive Completion Queue until ib->conn.available_recv_wrs is zero
and all receive samples are thus given back to the framework.
The QP type is dependend on the port space of the RDMA CM ID. If the
RDMA CM ID is set to TCP, the QP has to be set to a RC. If it is set to
UDP, it has to be set to UD.
Node is now able to send data in RDMA_PS_UDP mode. Right now it creates
a new rdma_cm_id for every connection request. We could/should do this
differently
The node blocks a certain amount of samples to use in its queues.
Before this commit, the only moment to release them to the framwork was
during ib_read()/ib_write().
But, there were a couple of problems. In the following I will take
ib_read() as example, but ib_write() will be analogous.
The first problem was:
1. If a QP disconnect, all Work Requests get invalidated and will be
"flushed" to a Completion Queue.
A possible solution would be, to save them in an intermediate buffer.
We could then "exchange" these samples with the framework as soon as the node
connects again and ib_read() is called again. So, we would get valid
samples from the framwork, post them, and give the "invalidated" samples back.
But, there is a second problem:
2. We cannot assume that ib_read() is ever called again after
ib_disconnect(). This is for example the case if the disconnect is
triggered by ib_stop() and not by an external node that disconnects.
This would result in a memory leak, since the samples would never be
returned to the framework, although the node is stopped.
Because of this second problem, I decided to return all samples with
sample_put() in the disconnect function. An additional benefit is that
this is more convenient than another buffer to temporarily safe the
invalidated samples.
Before, the node would throw an error as soon as it cannot connect to
the remote host. Now, it will throw a warning and switch to listening
mode (in which it will wait for another node to connect).
In the case that a node was already disconnected but not stopped,
rdma_cm_get_event always blocked and we coulnd't join the threads. This
is solved in this commit by registering SIGUSR1 to the CM event thread.
This bug originated in issue #152
The actual maximum size for inline mode is now returned to the user and
there is a check that inline_mode is either 0 or 1. Furthermore, this
commit includes a minor improvement in ib_write()
The user can set the maximum size of the inline data and the node checks
if a sample can be send inline. This commit doesn't contain a info
message to the user about what the final max inline size will be. (The
HCA will probably change the value set by the user.)
Now, ib_write() reads every cycle cnt values from the Completion Queue.
If it is not able to return them to the framework immediately, it
temporarily saves them on a stack.
ib_write() checks every cycle if the stack is non-empty and if it is
possible to return values from the stack to the framework.
The functions now look like this
int node_read(struct node *n, struct sample *smps[], unsigned cnt, unsigned *release);
int node_write(struct node *n, struct sample *smps[], unsigned cnt, unsigned *release);
This commit enables nodes to control how many samples will
be released by the framework through *release