I/O format

Entity and relation types

The list of entity types (each identified by a string), plus some information about each of them, is given in the entities dictionary in the configuration file. The list of relation types (each identified by its index in that list), plus some data like what their left- and right-hand side entity types are, is in the relations key of the configuration file.

Entities

The only information that needs to be provided about entities is how many there are in each entity type’s partition. This is done by putting a file named entity_count_type_part.txt for each entity type identified by type and each partition part in the directory specified by the entity_path config parameter. These files must contain a single integer (as text), which is the number of entities in that partition. The directory where all these files reside must be specified as the entity_path key of the configuration file.

It is possible to provide an initial value for the embeddings, by specifying a value for the init_path configuration key, which is the name of a directory that contains files in a format similar to the output format detailed in Checkpoint (possibly without the optimizer state dicts).

If no initial value is provided, it will be auto-generated, with each dimension sampled from the centered normal distribution whose standard deviation can be configured using the init_scale configuration key. For performance reasons the samples of all the entities of a certain type will not be independent.

Edges

For each bucket there must be a file that stores all the edges that fall in that bucket, of all relation types. This means that such a file is only identified by two integers, the partitions of its left- and right-hand side entities. It must be named edges_lhs_rhs.h5 (where lhs and rhs are the above integers), it must be a HDF5 file containing three one-dimensional datasets of the same length, called rel, lhs and rhs. The elements in the \(i\)-th positions in each of them define the \(i\)-th edge: rel identifies the relation type (and thus the left- and right-hand side entity types), lhs and rhs given the indices of the left- and right-hand side entities within their respective partitions.

To ease future updates to this format, each file must contain the format version in the format_version attribute of the top-level group. The current version is 1.

If an entity type is unpartitioned (that is, all its entities belong to the same partition), then the edges incident to these entities must still be uniformly spread across all buckets.

These files, for all buckets, must be stored in the same directory, which must be passed as the edge_paths configuration key. That key can actually contain a list of paths, each pointing to a directory of the format described above: in that case the graph will contain the union of all their edges.

Checkpoint

The training’s checkpoints are also its output, and they are written to the directory given as the checkpoint_path parameter in the configuration. Checkpoints are identified by successive positive integers, starting from 1, and all the files belonging to a certain checkpoint have an extra component .vversion between their name and extension (e.g., something.v42.h5 for version 42).

The latest complete checkpoint version is stored in an additional file in the same directory, called checkpoint_version.txt, which contains a single integer number, the current version.

Each checkpoint contains a JSON dump of the config that was used to produce it stored in the config.json file.

After a new checkpoint version is saved, the previous one will automatically be deleted. In order to periodically preserve some of these versions, set the checkpoint_preservation_interval config flag to the desired period (expressed in number of epochs).

Model parameters

The model parameters are stored in a file named model.h5, which is a HDF5 file containing one dataset for each parameter, all of which are located within the model group. Currently, the parameters that are provided are:

  • model/relations/idx/operator/side/param with the parameters of each relation’s operator.
  • model/entities/type/global_embedding with the per-entity type global embedding.

Each of these datasets also contains, in the state_dict_key attribute, the key it was stored inside the model state dict. An additional dataset may exist, optimizer/state_dict, which contains the binary blob (obtained through torch.save()) of the state dict of the model’s optimizer.

Finally, the top-level group of the file contains a few attributes with additional metadata. This mainly includes the format version, a JSON-dump of the config and some information about the iteration that produced the checkpoint.

Embeddings

Then, for each entity type and each of its partitions, there is a file embeddings_type_part.h5 (where type is the type’s name and part is the 0-based index of the partition), which is a HDF5 file with two datasets. One two-dimensional dataset, called embeddings, contains the embeddings of the entities, with the first dimension being the number of entities and the second being the dimension of the embedding.

Just like for the model parameters file, the optimizer state dict and additional metadata is also included.