Batch preparation¶
This section presents how the training data is prepared and organized in batches before the loss is calculated and optimized on each of them.
Training proceeds by iterating over the edges, through various nested loops. The
outermost one walks through so-called epochs. Each epoch is independent and
essentially equivalent to every other one. Their goal is to repeat the inner loop
until convergence. Each epoch visits all the edges exactly once. The number of
epochs is specified in the num_epochs
configuration parameter.
The edges are partitioned into edge sets (one for each directory of the edge_paths
configuration key) and, within each epoch, the edge sets are traversed in order.
When iterating over an edge set, each of its buckets is first divided into
equally sized chunks: each chunk spans a contiguous interval of edges (in the
order they are stored in the files) and the number of chunks can be tweaked
using the num_edge_chunks
configuration key. The training first operates
on the all the first chunks of all buckets, then on all of their second chunks,
and so on.
Next, the algorithm iterates over the buckets. The order in which buckets are
processed depends on the value of the bucket_order
configuration key. In
addition to a random permutation, there are methods that try to have successive
buckets share a common partition: this allows for that partition to be reused,
thus allowing it to be kept in memory rather than being unloaded and another one
getting loaded in its place. (In distributed mode,
the various trainer processes operate on the buckets at the same time, thus the
iteration is managed differently).
Once the trainer has fixed a given chunk and a certain bucket, its edges are
finally loaded from disk. When
evaluating during training, a subset of these
edges is withheld (such subset is the same for all epochs). The remaining edges
are immediately uniformly shuffled and then split into equal parts. These parts
are distributed among a pool of processes, so that the training can proceed
in parallel on all of them at the same time. These subprocesses are “Hogwild!”
workers, which do not synchronize their computations or memory accesses. The
number of such workers is determined by the workers
parameter.
The way each worker trains on its set of edges depends on whether
dynamic relations are in use. The simplest scenario is if
they are, in which case the edges are split into contiguous batches (each one having
the size specified in the batch_size
configuration key, except possibly the last
one which could be smaller). Training is then performed on that batch before moving
on to the next one.
When dynamic relations are not in use, however, the loss can only be computed on
a set of edges that are all of the same type. Thus the worker first randomly
samples a relation type, with probability proportional to the number of edges
of that type that are left in the pool. It then takes the first batch_size
relations of
that type (or fewer, if not enough of them are left), removes them from the pool and
performs training on them.