PyTerrier Data Repository

MSMARCO Passage Ranking

A passage ranking task based on a corpus of 8.8 million passages released by Microsoft, which should be ranked based on their relevance to questions. Also used by the TREC Deep Learning track.

Retrieval notebooks: View, Download

Variants

We have 6 index variants for this dataset:

terrier_stemmed_text

Last Update 2021-08-063.4GB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text'])

bm25_bert_terrier_stemmed_text = bm25_terrier_stemmed_text >> vanilla_bert

terrier_unstemmed

Last Update 2021-08-062.1GB

Terrier index, no stemming, no stopword removal.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed', wmodel='DPH')

terrier_stemmed_docT5query

Last Update 2021-06-122.3GB

Terrier index using docT5query. Porter stemming and stopword removal applied

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_docT5query = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_docT5query', wmodel='BM25')

terrier_stemmed

Last Update 2021-06-121.4GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed', wmodel='DPH')

terrier_stemmed_deepct

Last Update 2021-06-122.3GB

Terrier index using DeepCT. Porter stemming and stopword removal applied

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed_deepct = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_stemmed_deepct', wmodel='BM25')

terrier_unstemmed_text

Last Update 2021-06-123.1GB

Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'})

bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('msmarco_passage', 'terrier_unstemmed_text', wmodel='BM25', metadata=['docno', 'text'])

bm25_bert_terrier_unstemmed_text = bm25_terrier_unstemmed_text >> vanilla_bert