PyTerrier Data Repository

MSMARCO Document Ranking

A document ranking corpus containing 3.2 million documents. Also used by the TREC Deep Learning track.

Variants

We have 4 index variants for this dataset:

terrier_stemmed_text

Last Update 2021-06-138.3GB

Terrier's default Porter stemming, and stopwords removed. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}

bm25_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_text', wmodel='BM25')

bm25_bert_terrier_stemmed_text = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed_text', wmodel='BM25', metadata=['docno', 'text']) >> pt.text.sliding(length=128, stride=64, prepend_attr='title') >> vanilla_bert

terrier_unstemmed

Last Update 2021-06-134.1GB

Terrier index, no stemming, no stopword removal.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed', wmodel='DPH')

dph_bo1_terrier_unstemmed = dph_terrier_unstemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarco_document').get_index('terrier_unstemmed')) >> dph_terrier_unstemmed

terrier_stemmed

Last Update 2021-06-134.1GB

Terrier's default Porter stemming, and stopwords removed.

Browse index

Use this for retrieval in PyTerrier:

bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('msmarco_document').get_index('terrier_stemmed')) >> dph_terrier_stemmed

terrier_unstemmed_text

Last Update 2021-06-138.3GB

Terrier index, no stemming, no stopword removal. Text is also saved in the MetaIndex to facilitate BERT-based reranking.

Browse index

Use this for retrieval in PyTerrier:

import onir_pt

# Lets use a Vanilla BERT ranker from OpenNIR. We'll use the Capreolus model available from Huggingface

vanilla_bert = onir_pt.reranker('hgf4_joint', ranker_config={'model': 'Capreolus/bert-base-msmarco', 'norm': 'softmax-2'}

bm25_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed_text', wmodel='BM25')

bm25_bert_terrier_unstemmed_text = pt.BatchRetrieve.from_dataset('msmarco_document', 'terrier_unstemmed_text', wmodel='BM25', metadata=['docno', 'text']) >> pt.text.sliding(length=128, stride=64, prepend_attr='title') >> vanilla_bert