Pre-trained embeddings

For demonstration purposes and to save users their time, we provide pre-trained embeddings for some common public datasets.

Wikidata

Wikidata is a well-known knowledge base, which includes the discontinued Freebase knowledge base.

We used the so-called “truthy” dump from 2019-03-06, in the RDF NTriples format. (The original file isn’t available anymore on the Wikidata website). We used as entities all the distinct strings that appeared as either source or target nodes in this dump: this means that entities include URLs of Wikidata entities (in the form <http://www.wikidata.org/entity/Q123>), plain quoted strings (e.g., "Foo"), strings with language annotation (e.g., "Bar"@fr), dates and times, and possibly more. Similarly, we used as relation types all the distinct strings that appeared as properties. We then filtered out entities and relation types that appeared less than 5 times in the data dump.

The embeddings were trained with the following configuration:

def get_torchbiggraph_config():

    config = dict(
        # I/O data
        entity_path='data/wikidata',
        edge_paths=[],
        checkpoint_path='model/wikidata',

        # Graph structure
        entities={
            'all': {'num_partitions': 1},
        },
        relations=[{
            'name': 'all_edges',
            'lhs': 'all',
            'rhs': 'all',
            'operator': 'translation',
        }],
        dynamic_relations=True,

        # Scoring model
        dimension=200,
        global_emb=False,
        comparator='dot',

        # Training
        num_epochs=4,
        num_edge_chunks=10,
        batch_size=10000,
        num_batch_negs=500,
        num_uniform_negs=500,
        loss_fn='softmax',
        lr=0.1,
        relation_lr=0.01,

        # Evaluation during training
        eval_fraction=0.001,
        eval_num_batch_negs=10000,
        eval_num_uniform_negs=0,

        # Misc
        verbose=1,
    )

    return config

The output embeddings are available in various formats: