PyTerrier demonstration for trec-covid

This notebook demonstrates retrieval using PyTerrier on the TREC COVID corpus.

About the corpus: A collection of scientific articles related to COVID-19. This uses the 2020-07-16 version of the CORD-19, which is used by the TREC COVID complete benchmark.

In [1]:
#!pip install -q python-terrier
import pyterrier as pt
if not pt.started():
    pt.init()

from pyterrier.measures import *
dataset = pt.get_dataset('trec-covid')
        
PyTerrier 0.7.0 has loaded Terrier 5.6 (built by craigmacdonald on 2021-09-17 13:27)

Systems using index variant terrier_stemmed

Terrier's default Porter stemming, and stopwords removed

In [2]:
bm25_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='BM25')

dph_terrier_stemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_stemmed', wmodel='DPH')

dph_bo1_terrier_stemmed = dph_terrier_stemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_stemmed')) >> dph_terrier_stemmed

Systems using index variant terrier_unstemmed

Terrier index, no stemming, no stopword removal

In [3]:
bm25_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='BM25')

dph_terrier_unstemmed = pt.BatchRetrieve.from_dataset('trec-covid', 'terrier_unstemmed', wmodel='DPH')

dph_bo1_terrier_unstemmed = dph_terrier_unstemmed >> pt.rewrite.Bo1QueryExpansion(pt.get_dataset('trec-covid').get_index('terrier_unstemmed')) >> dph_terrier_unstemmed

Evaluation on TREC COVID Complete topics and qrels

50 topics from the TREC COVID task, with deep judgements. Using natural-language "description" queries.

In [4]:
pt.Experiment(
    [bm25_terrier_stemmed, dph_terrier_stemmed, dph_bo1_terrier_stemmed, bm25_terrier_unstemmed, dph_terrier_unstemmed, dph_bo1_terrier_unstemmed],
    pt.get_dataset('irds:cord19/trec-covid').get_topics('description'),
    pt.get_dataset('irds:cord19/trec-covid').get_qrels(),
    batch_size=200,
    filter_by_qrels=True,
    eval_metrics=[nDCG@10, P@5, P(rel=2)@5, AP],
    names=['bm25_terrier_stemmed', 'dph_terrier_stemmed', 'dph_bo1_terrier_stemmed', 'bm25_terrier_unstemmed', 'dph_terrier_unstemmed', 'dph_bo1_terrier_unstemmed'])
        
Out[4]:
name nDCG@10 P@5 P(rel=2)@5 AP
0 bm25_terrier_stemmed 0.624045 0.716 0.592 0.223192
1 dph_terrier_stemmed 0.642340 0.720 0.608 0.196439
2 dph_bo1_terrier_stemmed 0.680393 0.744 0.636 0.235759
3 bm25_terrier_unstemmed 0.472546 0.560 0.464 0.118504
4 dph_terrier_unstemmed 0.551274 0.648 0.536 0.126591
5 dph_bo1_terrier_unstemmed 0.620233 0.700 0.592 0.176572