GitHub - gsp2014/WikiPeople: An n-ary relational dataset derived from Wikidata

WikiPeople and its derivatives: N-ary relational datasets about people derived from Wikidata

WikiPeople was constructed as follows:

The Wikidata dump was downloaded and the facts concerning entities of type human were extracted.
Then, these facts were denoised. For example, facts containing element related to image were filtered out, and facts containing element in {unknown value, no values} were removed.
Subsequently, the subsets of elements which have at least 30 mentions were selected. And the facts related to these elements were kept. Further, each fact was parsed into a set of its role-value pairs.
The remaining facts were randomly split into training set, validation set and test set by a percentage of 80%:10%:10%.

The statistics of WikiPeople are displayed as follows, where #Train, #Valid and #Test are the sizes of the training set, the validation set and the test set, respectively.

	Binary	N-ary	Overall
#Train	270,179	35,546	305,725
#Valid	33,845	4,378	38,223
#Test	33,890	4,391	38,281

The training facts, validation facts and test facts stored in the files n-ary_train.json, n-ary_valid.json and n-ary_test.json, respectively, are in the same format. Each line therein is a set of ("role id": "value id/list of value ids") and the arity information in form of ("N": arity), corresponding to a fact in Wikidata about a certain person. Note that all the ids, except the ones that end with "_h" or "_t", are directly adopted from Wikidata. The two types of ids ending with "_h" or "_t" are defined by WikiPeople.

A fact example of WikiPeople

Take Line 37714 in n-ary_valid.json for example:

{
  "P166_h": "Q7186", 
  "P166_t": "Q38104", 
  "N": 5, 
  "P585": ["+1903-01-01T00:00:00Z#0#0#0#9#http://www.wikidata.org/entity/Q1985727"], 
  "P1706": ["Q41269", "Q37463"]
}

This example corresponds to the fact in Wikidata: Marie Curie received Nobel Prize in Physics in 1903, together with Henri Becquerel and Pierre Curie.

The detailed description of Line 37714 is as follows (items newly defined or introdued by WikiPeople are in italics and underlined):

Item	Description
P166	The id of the relation "award received"
P166_h	The id of the subject role of "award received"
Q7186	The id of the value "Marie Curie"
P166_t	The id of the object role of "award received"
Q38104	The id of the value "Nobel Prize in Physics"
"N":5	The arity of this fact is 5
P585	The id of the role "point in time"
+1903-01-01T00:00:00Z#0#0#0#9 #http://www.wikidata.org/entity/Q1985727	The id of the value "1903"
P1706	The id of the role "together with"
Q41269	The id of the value "Henri Becquerel"
Q37463	The id of the value "Pierre Curie"

Derive WikiPeople-n from WikiPeople

Since on WikiPeople, the percentage of n-ary relational facts is low (less than 12%), a new dataset, WikiPeople-n, was derived from WikiPeople. Detailedly, based on WikiPeople, all the n-ary relational facts were kept, and some binary relational facts were randomly removed to obtain the same percentage of binary and n-ary categories as in the training set on JF17K.

The statistics of WikiPeople-n are displayed as follows:

	Binary	N-ary	Overall
#Train	48,851	35,546	84,397
#Valid	6,190	4,248	10,438
#Test	6,186	4,264	10,450

Note that, binary relational facts from the training set, and then the validation set and the test set, were randomly removed respectively. After removing some binary relational facts from the training set, some elements (roles/values) may only exist in the validation/test set. The facts in the validation set and the test set, which contain these elements, were removed first, before conducting random removal.

Further vary the percentage of binary relational facts in WikiPeople

We vary the percentage (0%, 50%, and 100%) of binary relational facts in WikiPeople to get three datasets, WikiPeople-0bi, WikiPeople-50bi, and WikiPeople-100bi, respectively.

The statistics of these three datasets are presented as follows:

Dataset	WikiPeople-0bi	WikiPeople-50bi			WikiPeople-100bi
Category	N-ary/Overall	Binary	N-ary	Overall	Binary/Overall
#Train	35,546	35,590	35,546	71,136	270,179
#Valid	3,912	4,234	4,218	8,452	33,649
#Test	3,930	4,253	4,228	8,481	33,694

When using the datasets, please cite:

@inproceedings{NaLP,
  title={Link prediction on n-ary relational data},
  author={Guan, Saiping and Jin, Xiaolong and Wang, Yuanzhuo and Cheng, Xueqi},
  booktitle={Proceedings of the 28th International Conference on World Wide Web (WWW'19)},
  year={2019},
  pages={583--593}
}

For any questions, please contact guansaiping@ict.ac.cn or jinxiaolong@ict.ac.cn, or open an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiPeople and its derivatives: N-ary relational datasets about people derived from Wikidata

A fact example of WikiPeople

Derive WikiPeople-n from WikiPeople

Further vary the percentage of binary relational facts in WikiPeople

When using the datasets, please cite:

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
WikiPeople-0bi		WikiPeople-0bi
WikiPeople-100bi		WikiPeople-100bi
WikiPeople-50bi		WikiPeople-50bi
WikiPeople-n		WikiPeople-n
WikiPeople		WikiPeople
LICENSE		LICENSE
README.md		README.md

License

gsp2014/WikiPeople

Folders and files

Latest commit

History

Repository files navigation

WikiPeople and its derivatives: N-ary relational datasets about people derived from Wikidata

A fact example of WikiPeople

Derive WikiPeople-n from WikiPeople

Further vary the percentage of binary relational facts in WikiPeople

When using the datasets, please cite:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages