I/O format¶
Entity and relation types¶
The list of entity types (each identified by a string), plus some information
about each of them, is given in the entities
dictionary in the configuration file.
The list of relation types (each identified by its index in that list), plus
some data like what their left- and right-hand side entity types are, is in the
relations
key of the configuration file.
Entities¶
The only information that needs to be provided about entities is how many there
are in each entity type’s partition. This is done by putting a file named entity_count_type_part.txt
for each entity type identified
by type
and each partition part
in the directory specified by the entity_path
config parameter. These files must contain a single
integer (as text), which is the number of entities in that partition. The directory where all these
files reside must be specified as the entity_path
key of the configuration file.
It is possible to provide an initial value for the embeddings, by specifying a
value for the init_path
configuration key, which is the name of a directory that
contains files in a format similar to the output format detailed in
Checkpoint (possibly without the optimizer state dicts).
If no initial value is provided, it will be auto-generated, with each dimension
sampled from the centered normal distribution whose standard deviation can be
configured using the init_scale
configuration key. For performance reasons
the samples of all the entities of a certain type will not be independent.
Edges¶
For each bucket there must be a file that stores all the edges that fall in that
bucket, of all relation types. This means that such a file is only identified by
two integers, the partitions of its left- and right-hand side entities. It must
be named edges_lhs_rhs.h5
(where lhs
and rhs
are the above
integers), it must be a HDF5 file
containing three one-dimensional datasets of the same length, called rel
,
lhs
and rhs
. The elements in the \(i\)-th positions in each of them
define the \(i\)-th edge: rel
identifies the relation type (and thus the
left- and right-hand side entity types), lhs
and rhs
given the indices
of the left- and right-hand side entities within their respective partitions.
To ease future updates to this format, each file must contain the format version
in the format_version
attribute of the top-level group. The current version is 1.
If an entity type is unpartitioned (that is, all its entities belong to the same partition), then the edges incident to these entities must still be uniformly spread across all buckets.
These files, for all buckets, must be stored in the same directory, which must
be passed as the edge_paths
configuration key. That key can actually contain
a list of paths, each pointing to a directory of the format described above: in
that case the graph will contain the union of all their edges.
Checkpoint¶
The training’s checkpoints are also its output, and they are written to the directory
given as the checkpoint_path
parameter in the configuration. Checkpoints are identified
by successive positive integers, starting from 1, and all the files belonging to
a certain checkpoint have an extra component .vversion
between their name and extension
(e.g., something.v42.h5
for version 42).
The latest complete checkpoint version is stored in an additional file in the same directory, called
checkpoint_version.txt
, which contains a single integer number, the current version.
Each checkpoint contains a JSON dump of the config that was used to produce it stored in the config.json
file.
After a new checkpoint version is saved, the previous one will automatically be
deleted. In order to periodically preserve some of these versions, set the
checkpoint_preservation_interval
config flag to the desired period (expressed
in number of epochs).
Model parameters¶
The model parameters are stored in a file named model.h5
, which is a HDF5 file containing
one dataset for each parameter, all of which are located within the model
group. Currently, the
parameters that are provided are:
model/relations/idx/operator/side/param
with the parameters of each relation’s operator.model/entities/type/global_embedding
with the per-entity type global embedding.
Each of these datasets also contains, in the state_dict_key
attribute, the key it was stored inside the
model state dict. An additional dataset may exist, optimizer/state_dict
, which contains the binary blob
(obtained through torch.save()
) of the state dict of the model’s optimizer.
Finally, the top-level group of the file contains a few attributes with additional metadata. This mainly includes the format version, a JSON-dump of the config and some information about the iteration that produced the checkpoint.
Embeddings¶
Then, for each entity type and each of its partitions, there is a file
embeddings_type_part.h5
(where type
is the type’s name and part
is the 0-based index of the partition), which is a HDF5 file with two datasets.
One two-dimensional dataset, called embeddings
, contains the embeddings of
the entities, with the first dimension being the number of entities and the
second being the dimension of the embedding.
Just like for the model parameters file, the optimizer state dict and additional metadata is also included.