Corpus Filtering Details

As of the second year (2014) of the track, a pre-filtered version of the corpus was provided to reduce the filtering load on participants. For the 2015, a similar filtered corpus was provided, along pre-filtered version of the original 2013 track. This paage summarizes how these filtered corpora were created.


The TREC-TS-2014F dataset is a filtered version of the KBA 2014 corpus. It is stored in the same format, follows the same file structure (ordered into per-hour folders) and is encrypted with the same GPG key (see above). To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 15 events from the TREC-TS 2014 track topics were removed, i.e. only documents with timestamps between the start and end tag for one or more TREC-TS 2014 topics were kept. Second, we filtered the remaining documents, keeping only those which were likely to contain one or more relevant sentences to an event. This filtering was performed as follows:
  1. For each hour within the time period of each event, all documents from the KBA 2014 corpus that were published within that hour were indexed using the open source Terrier IR platform v4.0 (see The title of each document (if available), and any text within the body sentences were indexed. Terrier's stopword list and Porter stemming were applied.
  2. The TREC organisers manually identified a set of queries representing the topics of interest relating to each of the 15 events, creating event-query pairs. (These will not be released until after the final submission of runs)
  3. For each event-query pair, Terrier was used to retrieve the top 1000 documents for each query incrementally from each hour index (for the hours belonging to the associated event). The retrieval model used was BM25 with default parameters. In this way, we aim to create a high-recall set of documents for participants of summarise each event from.
  4. Documents that were not retrieved for one or more queries were then filtered out, forming the final TREC-TS-2014F dataset.


The TREC-TS-2015F dataset is a filtered version of the KBA 2014 corpus for the TREC-TS 2015 topics. The filtering methodology is identical to the TREC-TS-2014F dataset, with the exception of that the rank cutoff used was 100, rather than 1000. This smaller rank cutoff was chosen, since it was observed that most of the relevant content was available in the top documents. The result of this change is that the 2015 dataset is smaller than the 2014 dataset.


The TREC-TS-2013F dataset was released in 2015 for participants that wanted to train their systems using the 2013 topics. Importantly, the filtering methodology used to create this dataset is not the same as the other filtered versions. In particular, TREC-TS-2013F is a prefiltered  is a filtered version of the KBA 2013 corpus for the TREC-TS 2013 topics that was originally created by a participant to the 2014 TREC track.  To create this corpus, two levels of filtering were performed. First, any documents that were published out-with the time periods of the 9 events from the TREC-TS 2013 track topics were removed and only documents from the 'news' subset were considered. Second, the remaining documents were subject to a machine learned document classifier trained on hand annotated documents collected from the Reuters news agency for other events. This classifier uses basic distance metrics between the document and the initial event representation (query).