Pre-trained embeddings ====================== For demonstration purposes and to save users their time, we provide pre-trained embeddings for some common public datasets. .. _wiki-data: Wikidata -------- `Wikidata `_ is a well-known knowledge base, which includes the discontinued Freebase knowledge base. We used the so-called "truthy" dump from 2019-03-06, in the RDF NTriples format. (The original file isn't available anymore on the Wikidata website). We used as entities all the distinct strings that appeared as either source or target nodes in this dump: this means that entities include URLs of Wikidata entities (in the form :samp:``), plain quoted strings (e.g., :samp:`"{Foo}"`), strings with language annotation (e.g., :samp:`"{Bar}"@{fr}`), dates and times, and possibly more. Similarly, we used as relation types all the distinct strings that appeared as properties. We then filtered out entities and relation types that appeared less than 5 times in the data dump. The embeddings were trained with the following configuration:: def get_torchbiggraph_config(): config = dict( # I/O data entity_path='data/wikidata', edge_paths=[], checkpoint_path='model/wikidata', # Graph structure entities={ 'all': {'num_partitions': 1}, }, relations=[{ 'name': 'all_edges', 'lhs': 'all', 'rhs': 'all', 'operator': 'translation', }], dynamic_relations=True, # Scoring model dimension=200, global_emb=False, comparator='dot', # Training num_epochs=4, num_edge_chunks=10, batch_size=10000, num_batch_negs=500, num_uniform_negs=500, loss_fn='softmax', lr=0.1, relation_lr=0.01, # Evaluation during training eval_fraction=0.001, eval_num_batch_negs=10000, eval_num_uniform_negs=0, # Misc verbose=1, ) return config The output embeddings are available in various formats: - `wikidata_translation_v1.tsv.gz `_ (36GiB), a gzipped TSV (tab-separated value) file in an old format produced by ``torchbiggraph_export_to_tsv`` (see :ref:`here ` for how to parse it). - `wikidata_translation_v1_names.json.gz `_ (378MiB), a gzipped JSON-encoded list of all the keys in the first column of the TSV file. - `wikidata_translation_v1_vectors.npy.gz `_ (39.9GiB), a gzipped serialized NumPy array with the 200-dimension vectors, one for each line of the TSV file.