.. _io-format: I/O format ========== Entity and relation types ------------------------- The list of entity types (each identified by a string), plus some information about each of them, is given in the ``entities`` dictionary in the configuration file. The list of relation types (each identified by its index in that list), plus some data like what their left- and right-hand side entity types are, is in the ``relations`` key of the configuration file. Entities -------- The only information that needs to be provided about entities is how many there are in each entity type's partition. This is done by putting a file named :file:`entity_count_{type}_{part}.txt` for each entity type identified by ``type`` and each partition ``part`` in the directory specified by the ``entity_path`` config parameter. These files must contain a single integer (as text), which is the number of entities in that partition. The directory where all these files reside must be specified as the ``entity_path`` key of the configuration file. It is possible to provide an initial value for the embeddings, by specifying a value for the ``init_path`` configuration key, which is the name of a directory that contains files in a format similar to the output format detailed in :ref:`output-format` (possibly without the optimizer state dicts). If no initial value is provided, it will be auto-generated, with each dimension sampled from the centered normal distribution whose standard deviation can be configured using the ``init_scale`` configuration key. For performance reasons the samples of all the entities of a certain type will not be independent. Edges ----- For each bucket there must be a file that stores all the edges that fall in that bucket, of all relation types. This means that such a file is only identified by two integers, the partitions of its left- and right-hand side entities. It must be named :file:`edges_{lhs}_{rhs}.h5` (where ``lhs`` and ``rhs`` are the above integers), it must be a `HDF5 `_ file containing three one-dimensional datasets of the same length, called ``rel``, ``lhs`` and ``rhs``. The elements in the :math:`i`-th positions in each of them define the :math:`i`-th edge: ``rel`` identifies the relation type (and thus the left- and right-hand side entity types), ``lhs`` and ``rhs`` given the indices of the left- and right-hand side entities within their respective partitions. To ease future updates to this format, each file must contain the format version in the ``format_version`` attribute of the top-level group. The current version is 1. If an entity type is unpartitioned (that is, all its entities belong to the same partition), then the edges incident to these entities must still be uniformly spread across all buckets. These files, for all buckets, must be stored in the same directory, which must be passed as the ``edge_paths`` configuration key. That key can actually contain a list of paths, each pointing to a directory of the format described above: in that case the graph will contain the union of all their edges. .. _output-format: Checkpoint ---------- The training's checkpoints are also its output, and they are written to the directory given as the ``checkpoint_path`` parameter in the configuration. Checkpoints are identified by successive positive integers, starting from 1, and all the files belonging to a certain checkpoint have an extra component :file:`.v{version}` between their name and extension (e.g., :file:`{something}.v42.h5` for version 42). The latest complete checkpoint version is stored in an additional file in the same directory, called :file:`checkpoint_version.txt`, which contains a single integer number, the current version. Each checkpoint contains a JSON dump of the config that was used to produce it stored in the :file:`config.json` file. After a new checkpoint version is saved, the previous one will automatically be deleted. In order to periodically preserve some of these versions, set the ``checkpoint_preservation_interval`` config flag to the desired period (expressed in number of epochs). Model parameters ^^^^^^^^^^^^^^^^ The model parameters are stored in a file named :file:`model.h5`, which is a HDF5 file containing one dataset for each parameter, all of which are located within the ``model`` group. Currently, the parameters that are provided are: - :samp:`model/relations/{idx}/operator/{side}/{param}` with the parameters of each relation's operator. - :samp:`model/entities/{type}/global_embedding` with the per-entity type global embedding. Each of these datasets also contains, in the ``state_dict_key`` attribute, the key it was stored inside the model state dict. An additional dataset may exist, ``optimizer/state_dict``, which contains the binary blob (obtained through :func:`torch.save`) of the state dict of the model's optimizer. Finally, the top-level group of the file contains a few attributes with additional metadata. This mainly includes the format version, a JSON-dump of the config and some information about the iteration that produced the checkpoint. Embeddings ^^^^^^^^^^ Then, for each entity type and each of its partitions, there is a file :file:`embeddings_{type}_{part}.h5` (where ``type`` is the type's name and ``part`` is the 0-based index of the partition), which is a HDF5 file with two datasets. One two-dimensional dataset, called ``embeddings``, contains the embeddings of the entities, with the first dimension being the number of entities and the second being the dimension of the embedding. Just like for the model parameters file, the optimizer state dict and additional metadata is also included.