PyTerrier Data Repository

MSMARCOv2 Document Ranking

A document ranking corpus containing 11.9 million documents. Also used by the TREC 2021 Deep Learning track.

Retrieval notebooks: View, Download

Variants

We have 1 index variants for this dataset:

terrier_stemmed

Last Update 2021-07-0437.7GB

Terrier's default Porter stemming, and stopwords removed

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='BM25') dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='DPH') dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarcov2_document').get_index('terrier_stemmed')) >> dph_terrier_stemmed