# Data model¶

PBG operates on directed multi-relation multigraphs, whose vertices are called **entities**.
Each **edge** connects a source to a destination entity, which are respectively called its
**left-** and **right-hand side** (shortened to **LHS** and **RHS**). Multiple edges between
the same pair of entities are allowed. Loops, i.e., edges whose left- and right- hand sides
are the same, are allowed as well.

Each entity is of a certain **entity type** (one and only one type per entity).
Thus, the types partition all the entities into disjoint groups. Similarly, each
edge also belongs to exactly one **relation type**. All edges of a given
relation type must have all their left-hand side entities of the same entity
type and, similarly, all their right-hand side entities of the same entity type
(possibly a different entity type than the left-hand side one). This property
means that each relation type has a left-hand side entity type and a right-hand
side entity type.

In order for PBG to operate on large-scale graphs, the graph is broken
up into small pieces, on which training can happen in a distributed manner. This
is first achieved by further splitting the entities of each type into a certain
number of subsets, called **partitions**. Then, for each relation type, its
edges are divided into **buckets**: for each pair of partitions (one from the
left- and one from the right-hand side entity types for that relation type)
a bucket is created, which contains the edges of that type whose left- and
right-hand side entities are in those partitions.

Note

For technical reasons, at the current state all entity types that appear on the left-hand side of some relation type must be divided into the same number of partitions (except unpartitioned entities). The same must hold for all entity types that appear on the right-hand side. In numpy-speak, it means that the number of partitions of all entities must be broadcastable to the same value.

An entity is identified by its type, its partition and its index within the partition (indices must be contiguous, meaning that if there are \(N\) entities in a type’s partition, their indices lie in the half-open interval \([0, N)\)). An edge is identified by its type, its bucket (i.e., the partitions of its left- and right-hand side entity types) and the indices of its left- and right-hand side entities in their respective partitions. An edge doesn’t have to specify its left- and right-hand side entity types, because they are implicit in the edge’s relation type.

Formally, each bucket can be identifies by a pair of integers \((i, j)\), where \(i\) and \(j\) are respectively the left- and right-hand side partitions. Inside that bucket, each edge can be identified by a triplet of integers \((x, r, y)\), with \(x\) and \(y\) representing respectively the left- and right-hand side entities and \(r\) representing the relation type. This edge is “interpreted” by first looking up relation type \(r\) in the configuration, and finding out that it can only have entities of type \(e_1\) on its left-hand side and of type \(e_2\) on its right-hand side. One can then determine the left-hand side entity, which is given by \((e_1, i, x)\) (its type, its partition and its index within the partition), and, similarly, the right-hand side one which is \((e_2, j, y)\).