Configuration¶

All of PBG command-line binaries take a positional parameter that is a path to a configuration file. This file must be a Python file (ending in .py), that must implement a function called get_torchbiggraph_config, which takes no parameters and returns a JSON-like data structure (i.e., nested lists or dicts with string keys, whose leaf values are none, booleans, integers, floats or strings). The return value will be parsed according to the configuration schema. The schema can be found by running the binary with the --help option, and is reported here for completeness:

For some example configs, see the files inside the torchbiggraph/examples/configs directory.

The get_torchbiggraph_config function takes no arguments, but may execute arbitrary Python code. Therefore, a config file may dynamically construct a config based on environment variables or other file or system information.

All binaries also provide a -p flag that allows to override some configuration keys for a single execution.

Schema¶

While these lists provide a glance of all the options and give a brief description of their meaning and use, they are all described in more depth in the other sections of the documentation.

I/O data¶

See I/O format for more details.

entity_path (type: string)

The path of the directory containing entity count files.
edge_paths (type: list of strings)

A list of paths to directories containing (partitioned) edgelists. Typically a single path is provided.
checkpoint_path (type: string)

The path to the directory where checkpoints (and thus the output) will be written to. If checkpoints are found in it, training will resume from them.
init_path (type: string or null; default: null)

If set, it must be a path to a directory that contains initial values for the embeddings of all the entities of some types.
checkpoint_preservation_interval (type: integer or null; default: null)

If set, every so many epochs a snapshot of the checkpoint will be archived. The snapshot will be located inside a epoch_N sub-directory of the checkpoint directory, and will contain symbolic links to the original checkpoint files, which will not be cleaned-up as it would normally happen.

Graph structure¶

See Data model for more details.

entities (type: dict of entities, see below for the entity schema)

The entity types. The ID with which they are referenced by the relation types is the key they have in this dict.

The sub-schema of the values of this dictionary is:
- num_partitions (type: integer)
  
  Number of partitions for this entity type. Set to 1 if unpartitioned. All other entity types must have the same number of partitions.
- featurized (type: boolean; default: false)
  
  Whether the entities of this type are represented as sets of features.
  
  See Featurized entities for more details.
relations (type: list of relations, see below for the relation schema)

The relation types. The ID with which they will be referenced in the edge lists is their index in this list.

The sub-schema of the items of this list is:
- name (type: string)
  
  A human-readable identifier for the relation type. Not needed for training, only used for logging.
- lhs (type: string)
  
  The type of entities on the left-hand side of this relation, i.e., its key in the entities dict.
- rhs (type: string)
  
  The type of entities on the right-hand side of this relation, i.e., its key in the entities dict.
- weight (type: number; default: 1.0)
  
  The weight by which the loss induced by edges of this relation type will be multiplied.
  
  See Loss functions for more details.
- operator (type: string, either "none", "translation", "diagonal", "linear", "affine" or "complex_diagonal"; default: "none")
  
  The transformation to apply to the embedding of one of the sides of the edge (typically the right-hand one) before comparing it with the other one.
  
  See Operators for more details.
- all_negs (type: boolean; default: false)
  
  If enabled, the negatives for \((x, r, y)\) will consist of \((x, r, y')\) for all entities \(y'\) of the same type and in the same partition as \(y\), and, symmetrically, of \((x', r, y)\) for all entities \(x'\) of the same type and in the same partition as \(x\).
  
  See All negatives for more details.

Scoring model¶

See From entity embeddings to edge scores for more details.

dimension (type: integer)

The dimension of the real space the embedding live in.
init_scale (type: number; default: 0.001)

If no initial embeddings are provided, they are generated by sampling each dimension from a centered normal distribution having this standard deviation. (For performance reasons, sampling isn’t fully independent.)
max_norm (type: number or null; default: null)

If set, rescale the embeddings if their norm exceeds this value.
global_emb (type: boolean; default: true)

If enabled, add to each embedding a vector that is common to all the entities of a certain type. This vector is learned during training.
comparator (type: string, either "dot", "cos", "l2" or "squared_l2"; default: "cos")

How the embeddings of the two sides of an edge (after having already undergone some processing) are compared to each other to produce a score.
bias (type: boolean; default: false)

If enabled, withhold the first dimension of the embeddings from the comparator and instead use it as a bias, adding back to the score. Makes sense for logistic and softmax loss functions.

Training¶

See Batch preparation for more details.

num_epochs (type: integer; default: 1)

The number of times the training loop iterates over all the edges.
num_edge_chunks (type: integer or null; default: null)

The number of equally-sized parts each bucket will be split into. Training will first proceed over all the first chunks of all buckets, then over all the second chunks, and so on. A higher value allows better mixing of partitions, at the cost of more time spent on I/O. If unset, will be automatically calculated so that no chunk has more than max_edges_per_chunk edges.
max_edges_per_chunk (type: integer, default: 1000000000)

The maximum number of edges that each edge chunk should contain if the number of edge chunks is left unspecified and has to be automatically figured out. Each edge takes up at least 12 bytes (3 int64s), more if using featurized entities.
bucket_order (type: string, either "random", "affinity", "inside_out" or "outside_in"; default: "inside_out")

The order in which to iterate over the buckets.
workers (type: integer or null; default: null)

The number of worker processes for “Hogwild!” training. If not given, set to CPU count.
batch_size (type: integer; default: 1000)

The number of edges per batch.

See Negative sampling for more details.

num_batch_negs (type: integer; default: 50)

The number of negatives sampled from the batch, per positive edge.
num_uniform_negs (type: integer; default: 50)

The number of negatives uniformly sampled from the currently active partition, per positive edge.
disable_lhs_negs (type: boolean; default: false)

Disable negative sampling on the left-hand side.
disable_rhs_negs (type: boolean; default: false)

Disable negative sampling on the right-hand side.

See Loss functions for more details.

loss_fn (type: string, either "ranking", "logistic" or "softmax"; default: "ranking")

How the scores of positive edges and their corresponding negatives are evaluated.
margin (type: number or null; default: 0.1)

When using ranking loss, this value controls the minimum separation between positive and negative scores, below which a (linear) loss is incurred.

See Optimizers for more details.

lr (type: number; default: 0.01)

The learning rate for the optimizer.
relation_lr (type: number or null; default: null)

If set, the learning rate for the optimizer for relations. Otherwise, lr is used.

Evaluation during training¶

See Evaluation during training for more details.

eval_fraction (type: number; default: 0.05)

The fraction of edges withheld from training and used to track evaluation metrics during training.
eval_num_batch_negs (type: integer; default: 1000)

The value that overrides the number of negatives per positive edge sampled from the batch during the evaluation steps that occur before and after each training step.
eval_num_uniform_negs (type: integer; default: 1000)

The value that overrides the number of uniformly-sampled negatives per positive edge during the evaluation steps that occur before and after each training step.

Distributed training¶

See Distributed mode for more details.

num_machines (type: integer; default: 1)

The number of machines for distributed training.
num_partition_servers (type: integer; default: -1)

If -1, use trainer as partition servers. If 0, don’t use partition servers (instead, swap partitions through disk). If >1, then that number of partition servers must be started manually.
distributed_init_method (type: string or null; default: null)

A URI defining how to synchronize all the workers of a distributed run. Must start with a scheme (e.g., file:// or tcp://) supported by PyTorch.
distributed_tree_init_order (type: boolean; default: true)

If enabled, then distributed training can occur on a bucket only if at least one of its partitions was already trained on before in the same round (or if one of its partitions is 0, for bootstrapping).

Dynamic relations¶

See Dynamic relations for more details.

dynamic_relations (type: boolean; default: false)

If enabled, activates the dynamic relation mode, in which case, there must be a single relation type in the config (whose parameters will apply to all dynamic relations types) and there must be a file called dynamic_rel_count.txt in the entity path that contains the number of dynamic relations. In this mode, batches will contain edges of multiple relation types and negatives will be sampled differently.

Misc¶

verbose (type: integer; default: 0)

The verbosity level of logging, currently 0 or 1.
hogwild_delay (type: number; default: 2.0)

The number of seconds by which to delay the start of all “Hogwild!” processes except the first one, on the first epoch.