Featurized entities


This is an advanced feature, which is still under development and hasn’t fully stabilized yet.

In normal operation PBG considers each entity atomic and distinct from all others, and as such it learns an independent embedding for each of them, with no correlation other than the one acquired during training. However, it is common practice to represent some type of data as collections of underlying “features”, each of which has its own learned embedding. The embedding of an entity will be implicitly derived from the embeddings of its features. Sharing a feature will enforce a correlation between the embeddings of two entities.

For example, entities that represent text documents could have their words as features, i.e., an embedding is learned for each word and the embedding of a document is the average of the embeddings of the words it contains.

PBG provides this capability. Featurized mode is activated on a per-entity type basis by enabling the featurized flag on its config. As this feature isn’t finalized yet, the tooling around it isn’t up to par with non-featurized entities, in particular for converting featurized edgelists to the PBG format. Practitioners will have to implement their own converters, based on the format described below. Contributions of converters to and from standard formats are welcome.

The following changes occur in the training process when featurized entities are enabled:

  • The count stored in the entity_count_type_part.txt file refers to the total number of different features that are encountered in the edge files, rather than to the number of different sets of features.

  • Each edge file edges_lhs_rhs.h5 must contain a few more datasets. If any edge in it has a featurized entity on the left-hand side then it must contain two one-dimensional datasets of integers: lhsd_data, which contains the flattened concatenation of the lists of features of all left-hand side entities of the edges in the file, and lhsd_offsets, which contains the “cutpoints” of lhsd_data where the feature list of one entity ends and the one for the next entity starts.

    Thus the entries of lhsd_data are feature identifiers, while the entries of lhsd_offsets are indices of lhsd_data. Each pair of consecutive entries of lhsd_offsets represents an half-open interval of lhsd_data, thus the first entry of lhsd_offsets should be 0, the last entry should be the size of lhsd_data, and entries should be in non-decreasing order. If the edge file contains \(N\) edges, then lhsd_offsets must contain \(N + 1\) entries.

    • If the left-hand side entity of edge \(i\) is featurized, then its features will be the values of lhsd_data between positions lhsd_offsets\([i]\) (inclusive) and lhsd_offsets\([i+1]\) (exclusive). The \(i\)-th entry of the lhs dataset, on the other hand, can be any value, as it will be ignored.
    • If the left-hand side entity of edge \(i\) is not featurized, then the offset of the entity will be in lhs\([i]\), just as usual. In that case its set of features should be empty, that is, one should have lhsd_offsets\([i]\) equal to lhsd_offsets\([i+1]\).

    If any right-hand side entity is featurized, the same must hold for datasets rhsd_offsets and rhsd_data.

  • Entities are represented as “bags of features”. That is, their embeddings will be the average of the embeddings of their features, similarly to how text documents can be represented as the average of the embeddings of the words they contain.

  • The only form of negative sampling supported for featurized entities is the same-batch mode. Both the all negatives and the uniformly-sampled negatives mode are not supported. Observe that this means that uniform sampling of negatives must be disabled globally.