PyTerrier Data Repository

Vaswani

The Vaswani NPL corpus is a small test collection of 11,000 abstracts has been used by the Glasgow IR group for many years (created 1990). Due to its small size, it is used for many test cases used in both Terrier and PyTerrier.

Retrieval notebooks: View, Download

Variants

We have 4 index variants for this dataset:

terrier_stemmed_text

Last Update 2021-07-044.5MB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed_text', wmodel='BM25')

dph_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed_text', wmodel='DPH')

dph_bo1_terrier_stemmed_text = dph_terrier_stemmed_text >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_stemmed_text')) >> dph_terrier_stemmed_text

terrier_unstemmed

Last Update 2021-08-063.4MB

Terrier index, no stemming, no stopword removal

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_unstemmed', wmodel='DPH')

dph_bo1_terrier_unstemmed = dph_terrier_unstemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_unstemmed')) >> dph_terrier_unstemmed

terrier_stemmed

Last Update 2021-07-042.7MB

Terrier's default Porter stemming, and stopwords removed

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('vaswani', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('vaswani').get_index('terrier_stemmed')) >> dph_terrier_stemmed

colbert_uog44k

Last Update 2021-09-17167.1MB

ColBERT dense retrieval index using model trained by UoG for TREC 2020 DL track

Browse index

Use this for retrieval in PyTerrier:

from pyterrier_colbert.ranking import ColBERTFactory

colbert_e2e = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').end_to_end()

colbert_prf_rank = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').colbert_prf(rerank=False)

colbert_prf_rerank = ColBERTFactory.from_dataset('vaswani', 'colbert_uog44k').colbert_prf(rerank=True)