------------------------------ ReVerb ClueWeb Extractions 1.1 ------------------------------ Authors: Anthony Fader (afader@cs.washington.edu) Michael Schmitz Robert Bart Stephen Soderland Oren Etzioni This file describes the ReVerb ClueWeb Extractions data set. The data set contains approximately 15 million binary assertions from the Web. The data is available for download from the ReVerb homepage at: http://reverb.cs.washington.edu/ This data is for academic use only. If you use it in your research, please use the following citation: @inproceedings{Fader11, author = {Anthony Fader and Stephen Soderland and Oren Etzioni}, title = {Identifying Relations for Open Information Extraction}, booktitle = {Proceedings of the Conference of Empirical Methods in Natural Language Processing ({EMNLP} '11)}, year = {2011}, month = {July 27-31}, address = {Edinburgh, Scotland, UK} } ---------- Statistics ---------- Each record in the data set corresponds to a (arg1, relation, arg2) tuple. The numbers below are the number of distinct tuples, argument strings, and relation strings in the data set: Tuples: 14,728,268 Argument Strings: 2,263,915 Relation Strings: 664,746 ------------- Preprocessing ------------- This data set is a subset of ReVerb's output run on the English portion of the ClueWeb09 corpus. For more information on ClueWeb09, please visit: http://lemurproject.org/clueweb09.php/ We ran ReVerb version 1.0 on the Corpus, which resulted in approximately 6 billion extractions. This initial set of extractions is too large to distribute over the Web, and is noisy. To obtain a smaller, higher-precision data set, we filtered the extractions along these dimensions: - Confidence Threshold: Each (extraction, sentence) pair is assigned a confidence score in [0,1] by a classifier (see Fader et al. 2011 for more details). We removed all extractions with a confidence value less than 0.9. - Syntactic Filter: We filtered extractions based on some syntactic features of the arguments and relations. We filtered out any extraction with arguments that are definite common nouns (the NN or the NNS), or contain pronouns, demonstrative determiners, and certain quantifiers (both, all, certain, other, etc.). We also filtered out any extraction containing a proper noun or number in its relation. - Stopword Filter: We removed extractions that consisted of common temporal words (e.g. yesterday, tonight, days of the week) and extractions with relations that are almost always uninformative (e.g. have, is, said). - String Frequency Threshold: We counted the number of distinct extractions each string argument and relation appeared in, and then filtered out any extraction (x,r,y) with (freq(x) < 5) or (freq(r) < 5) or (freq(y) < 5). After applying these filters, we applied a simple morphological normalization to arguments and relations by removing tense, pluralization, capitalization, etc. We then merged extractions having the same normalized form. ------ Format ------ The extractions are stored in a flat text file encoded in UTF-8. The file has the following tab-separated columns: 1. Extraction id 2. Argument 1 3. Relation 4. Argument 2 5. Argument 1 - normalized 6. Relation - normalized 7. Argument 2 - normalized 8. The number of *distinct* sentences this extraction was extracted from 9. The maximum confidence score assigned to this extraction, over all sentences it was extracted from 10. A list of source URLs (separated by "|") for each sentence We are unable to provide the source sentences themselves. However, they can be obtained from the included URLs. ------- Contact ------- If you have any questions, comments, or would like access to a different subset of the ClueWeb09 extractions, please contact Anthony Fader (afader@cs.washington.edu).