From entity embeddings to edge scores

The goal of training is to embed each entity in \(\mathbb{R}^D\) so that the embeddings of two entities are a good proxy to predict whether there is a relation of a certain type between them.

To be more precise, the goal is to learn an embedding for each entity and a function for each relation type that takes two entity embeddings and assigns them a score, with the goal of having positive relations achieve higher scores than negative ones.

All the edges provided in the training set are considered positive instances. In order to perform training, a set of negative edges is needed as well. These are not provided by the user but instead generated by the system during training (see Negative sampling), usually by fixing the left-hand side entity and the relation type and sampling a new right-hand side entity, or vice versa. This sampling scheme makes sense for large sparse graphs, where there is a low probability that edges generated this way are true positives edges in the graph.

A priori, entity embeddings could take any value in \(\mathbb{R}^D\). Although, in some cases (for example when restricting them to be within a certain ball, or when comparing them using cosine distance), their “angle” will have greater importance than their norm.

Per-relation scoring functions, however, must be expressible in a specific form (the most common functions in the literature can be converted to such a representation). In the current implementation, they are only allowed to transform the embedding of one of the two sides, which is then compared to the un-transformed embedding of the other side using a generic symmetric comparator function, which is the same for all relations. Formally, for left- and right-hand side entities \(x\) and \(y\) respectively, and for a relation type \(r\), the score is:

\[f_r(\theta_x, \theta_y) = c(\theta_x, g_r(\theta_y))\]

where \(\theta_x\) and \(\theta_y\) are the embeddings of \(x\) and \(y\) respectively, \(f_r\) is the scoring function for \(r\), \(g_r\) is the operator for \(r\) and \(c\) is the comparator.

Under “normal” circumstances (the so-called “standard” relations mode) the operator is solely applied to the right-hand side entities. This is not the case when using dynamic relations. Applying the operator to both sides would oftentimes be redundant. Also, preferring one side over the other allows to break the symmetry and capture the direction of the edge.

Embeddings

Embeddings live in a \(D\)-dimensional real space, where \(D\) is determined by the dimension configuration parameter.

Normally, each entity has its own embedding, which is entirely independent from any other entity’s embedding. When using featurized entities however this works differently, and an entity’s embedding will be the average of the embeddings of its features.

If the max_norm configuration parameter is set, embeddings will be projected onto the unit ball with radius max_norm after each parameter update.

To add a new type of embedding, one needs to subclass the torchbiggraph.model.AbstractEmbedding class.

Global embeddings

When the global_emb configuration option is active, each entity’s embedding will be translated by a vector that is specific to each entity type (and that is learned at the same time as the embeddings).

Operators

The operators that are currently provided are:

  • none, no-op, which leaves the embeddings unchanged;
  • translation, which adds to the embedding a vector of the same dimension;
  • diagonal, which multiplies each dimension by a different coefficient (equivalent to multiplying by a diagonal matrix);
  • linear, which applies a linear map, i.e., multiplies by a full square matrix
  • affine, which applies a affine transformation, i.e., linear followed by translation.
  • complex_diagonal, which interprets the \(D\)-dimensional real vector as a \(D/2\)-dimensional complex vector (\(D\) must be even; the first half of the vector are the real parts, the second half the imaginary parts) and then multiplies each entry by a different complex parameter, just like diagonal.

All the operators’ parameters are learned during training.

To define an additional operator, one must subclass the torchbiggraph.model.AbstractOperator class (or the torchbiggraph.model.AbstractDynamicOperator one when using dynamic relations; their docstrings explain what must be implemented) and decorate it with the torchbiggraph.model.OPERATORS.register_as() decorator (respectively the torchbiggraph.model.DYNAMIC_OPERATORS.register_as() one), specifying a new name that can then be used in the config to select that comparator. All of the above can be done inside the config file itself.

Comparators

The available comparators are:

  • dot, the dot-product, which computes the scalar or inner product of the two embedding vectors;
  • cos, the cos distance, which is the cosine of the angle between the two vectors or, equivalently, the dot product divided by the product of the vectors’ norms.
  • l2, the negative L2 distance, a.k.a. the Euclidean distance (negative because smaller distances should get higher scores).
  • squared_l2, the negative squared L2 distance.

Custom comparators need to extend the torchbiggraph.model.AbstractComparator class (its docstring explains how) and decorate it with the torchbiggraph.model.COMPARATORS.register_as() decorator, specifying a new name that can then be used in the config to select that comparator. All of the above can be done inside the config file itself.

Bias

If the bias configuration key is in use, then the first coordinate of the embeddings will act as a bias in the comparator computation. This means that the comparator will be computed on the last \(D - 1\) entries of the vectors only, and then both the first entries of the two vectors will be added to the result.

Coherent sets of configuration parameters

While the parameters described in this chapter are exposed as uncoupled knobs in the configuration file (to more closely match the implementation, and to allow for more flexible tuning), some combinations of them are more sensible than others.

Apart from the default one, the following configuration has been found to work well: init_scale = 0.1, comparator = dot, bias = true, loss_fn = logistic, lr = 0.1.

Interpreting the scores

The scores will be tuned to have different meaning and become more suitable for certain applications based on the loss function used during training. Common options include ranking what other entities may be related to a given entity, determining the probability that a certain relation exists between two given entities, etc.

Todo

Talk about what you can do with the trained embeddings (e.g., compute P(edge), k-nearest-neighbors, or training downstream classifiers on the features). Also, it would be nice to have a little script where one could manually feed it an edge and it could spit out a score. Or some nearest neighbor tool.