------------------------------
ReVerb ClueWeb Extractions 1.1
------------------------------

Authors:
    Anthony Fader (afader@cs.washington.edu)
    Michael Schmitz
    Robert Bart
    Stephen Soderland
    Oren Etzioni


This file describes the ReVerb ClueWeb Extractions data set. The data set 
contains approximately 15 million binary assertions from the Web. The data is 
available for download from the ReVerb homepage at: 

    http://reverb.cs.washington.edu/

This data is for academic use only. If you use it in your research, please 
use the following citation:

@inproceedings{Fader11,
  author =   {Anthony Fader and Stephen Soderland and Oren Etzioni},
  title =    {Identifying Relations for Open Information Extraction},
  booktitle =    {Proceedings of the Conference of Empirical Methods
                  in Natural Language Processing ({EMNLP} '11)},
  year =     {2011},
  month =    {July 27-31},
  address =  {Edinburgh, Scotland, UK}
}


----------
Statistics
----------

Each record in the data set corresponds to a (arg1, relation, arg2) tuple. The
numbers below are the number of distinct tuples, argument strings, and
relation strings in the data set:

    Tuples:           14,728,268
    Argument Strings:  2,263,915
    Relation Strings:    664,746

-------------
Preprocessing
-------------

This data set is a subset of ReVerb's output run on the English portion of 
the ClueWeb09 corpus. For more information on ClueWeb09, please visit:

    http://lemurproject.org/clueweb09.php/

We ran ReVerb version 1.0 on the Corpus, which resulted in approximately 6 
billion extractions. This initial set of extractions is too large to distribute
over the Web, and is noisy.

To obtain a smaller, higher-precision data set, we filtered the extractions 
along these dimensions: 

- Confidence Threshold: Each (extraction, sentence) pair is assigned a 
  confidence score in [0,1] by a classifier (see Fader et al. 2011 for more
  details). We removed all extractions with a confidence value less than 0.9.

- Syntactic Filter: We filtered extractions based on some syntactic features
  of the arguments and relations. We filtered out any extraction with 
  arguments that are definite common nouns (the NN or the NNS), or contain 
  pronouns, demonstrative determiners, and certain quantifiers (both, all, 
  certain, other, etc.). We also filtered out any extraction containing a 
  proper noun or number in its relation.

- Stopword Filter: We removed extractions that consisted of common temporal 
  words (e.g. yesterday, tonight, days of the week) and extractions with
  relations that are almost always uninformative (e.g. have, is, said).

- String Frequency Threshold: We counted the number of distinct extractions 
  each string argument and relation appeared in, and then filtered out any
  extraction (x,r,y) with (freq(x) < 5) or (freq(r) < 5) or (freq(y) < 5).

After applying these filters, we applied a simple morphological normalization
to arguments and relations by removing tense, pluralization, capitalization, 
etc. We then merged extractions having the same normalized form.


------
Format
------

The extractions are stored in a flat text file encoded in UTF-8. The file has
the following tab-separated columns:

     1. Extraction id
     2. Argument 1
     3. Relation
     4. Argument 2
     5. Argument 1 - normalized
     6. Relation - normalized
     7. Argument 2 - normalized
     8. The number of *distinct* sentences this extraction was extracted from
     9. The maximum confidence score assigned to this extraction, over all 
        sentences it was extracted from
    10. A list of source URLs (separated by "|") for each sentence

We are unable to provide the source sentences themselves. However, they can
be obtained from the included URLs.

-------
Contact
-------

If you have any questions, comments, or would like access to a different
subset of the ClueWeb09 extractions, please contact Anthony Fader
(afader@cs.washington.edu).