PyTerrier Data Repository

MSMARCOv2 Document Ranking

A new version of the MSMARCO document ranking corpus, containing 11.9 million documents. Also used by the TREC 2021 Deep Learning track.

Retrieval notebooks: View, Download

Variants

We have 2 index variants for this dataset:

terrier_stemmed_positions

Last Update 2021-07-0577.7GB

Terrier index, default Porter stemming, and stopwords removed. Position information is saved for proximity or phrase queries.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed_positions', wmodel='BM25')

dph_terrier_stemmed_positions = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed_positions', wmodel='DPH')

dph_bo1_terrier_stemmed_positions = dph_terrier_stemmed_positions >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarcov2_document').get_index('terrier_stemmed_positions')) >> dph_terrier_stemmed_positions

terrier_stemmed

Last Update 2021-07-0437.7GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_document', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarcov2_document').get_index('terrier_stemmed')) >> dph_terrier_stemmed