Featurized entities¶
Caution
This is an advanced feature, which is still under development and hasn’t fully stabilized yet.
In normal operation PBG considers each entity atomic and distinct from all others, and as such it learns an independent embedding for each of them, with no correlation other than the one acquired during training. However, it is common practice to represent some type of data as collections of underlying “features”, each of which has its own learned embedding. The embedding of an entity will be implicitly derived from the embeddings of its features. Sharing a feature will enforce a correlation between the embeddings of two entities.
For example, entities that represent text documents could have their words as features, i.e., an embedding is learned for each word and the embedding of a document is the average of the embeddings of the words it contains.
PBG provides this capability. Featurized mode is activated on a per-entity type basis by enabling the
featurized
flag on its config. As this feature isn’t finalized yet, the tooling around it isn’t up to par
with non-featurized entities, in particular for converting featurized edgelists to the PBG format.
Practitioners will have to implement their own converters, based on the format described below.
Contributions of converters to and from standard formats are welcome.
The following changes occur in the training process when featurized entities are enabled:
The count stored in the
entity_count_type_part.txt
file refers to the total number of different features that are encountered in the edge files, rather than to the number of different sets of features.Each edge file
edges_lhs_rhs.h5
must contain a few more datasets. If any edge in it has a featurized entity on the left-hand side then it must contain two one-dimensional datasets of integers:lhsd_data
, which contains the flattened concatenation of the lists of features of all left-hand side entities of the edges in the file, andlhsd_offsets
, which contains the “cutpoints” oflhsd_data
where the feature list of one entity ends and the one for the next entity starts.Thus the entries of
lhsd_data
are feature identifiers, while the entries oflhsd_offsets
are indices oflhsd_data
. Each pair of consecutive entries oflhsd_offsets
represents an half-open interval oflhsd_data
, thus the first entry oflhsd_offsets
should be 0, the last entry should be the size oflhsd_data
, and entries should be in non-decreasing order. If the edge file contains \(N\) edges, thenlhsd_offsets
must contain \(N + 1\) entries.- If the left-hand side entity of edge \(i\) is featurized, then its features will be the values of
lhsd_data
between positionslhsd_offsets
\([i]\) (inclusive) andlhsd_offsets
\([i+1]\) (exclusive). The \(i\)-th entry of thelhs
dataset, on the other hand, can be any value, as it will be ignored. - If the left-hand side entity of edge \(i\) is not featurized, then the offset of the entity will be in
lhs
\([i]\), just as usual. In that case its set of features should be empty, that is, one should havelhsd_offsets
\([i]\) equal tolhsd_offsets
\([i+1]\).
If any right-hand side entity is featurized, the same must hold for datasets
rhsd_offsets
andrhsd_data
.- If the left-hand side entity of edge \(i\) is featurized, then its features will be the values of
Entities are represented as “bags of features”. That is, their embeddings will be the average of the embeddings of their features, similarly to how text documents can be represented as the average of the embeddings of the words they contain.
The only form of negative sampling supported for featurized entities is the same-batch mode. Both the all negatives and the uniformly-sampled negatives mode are not supported. Observe that this means that uniform sampling of negatives must be disabled globally.