PyTerrier Data Repository

MSMARCO v2 Passage Ranking

A revised corpus of 138M passages released by Microsoft in July 2021, which should be ranked based on their relevance to questions. Also used by the TREC 2021 Deep Learning track.

Variants

We have 2 index variants for this dataset:

terrier_unstemmed

Last Update 2021-08-0639.9GB

Terrier index, no stemming, no stopword removal.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='BM25')

bm25_bert_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed') >> pt.text.get_text(irds:msmarco-passage-v2) >> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH')

dph_bert_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_unstemmed', wmodel='DPH') >> pt.text.get_text(irds:msmarco-passage-v2) >> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}

terrier_stemmed

Last Update 2021-08-0836.2GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='BM25')

bm25_bert_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed') >> pt.text.get_text(irds:msmarco-passage-v2) >> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH')

dph_bert_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarcov2_passage', 'terrier_stemmed', wmodel='DPH') >> pt.text.get_text(irds:msmarco-passage-v2) >> onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}