We have been using BioReader to successfully separate article corpora in the following use cases:
Example 1: Classifying articles by the presence or absence of flow cytometry data measuring the expression of surface proteins. For this project, we manually curated 508 articles and separated these into the two categories, one containing the desired information, and the other not containing the information. We then searched PubMed for articles to classify by searching for different cell types and proteins of interest, e.g. “T cell [AND] CD8”. For more information, see Barnkob MS, Simon C, Olsen LR: Characterizing the human hematopoietic CDome. Frontiers in Genetics, 2014.
Example 2: Classifying articles by the presence or absence of reported tumor specific T cell epitopes. For this project, we manually curated 512 articles (256 in each category) and used the resulting classifier to classify articles from which we extracted tumor T cell epitopes for the TANTIGEN database. For more information, see Olsen LR, Kudahl UJ, Winther O, Brusic V: Literature classification for semi-automated updating of biological knowledgebases. BMC Genomics, 2013.
Example 3: Classifying articles by epitope-specific data or epitope structure. This data was used for the Immune Epitope Database. Set 1 contains 1,000 abstracts for which the articles contain epitope-specific data or epitope and 1,000 abstracts not containing this information. For more information, see Vita R, Peters B, Sette A: The curation guidelines of the Immune Epitope Database and Analysis Resource. Cytometry A, 2008.
To use BioReader, you need two data sets:
Before using BioReader you must prepare your datasets. The training data must be manually curated and separated into (at least) two classes. For these articles, you must fetch PubMed IDs and copy/paste these into the relevant input box, separated by comma or a new line. Collecting these IDs as you curate your articles is by far the easiest approach, but they can be fetched after the curation using the Entrez Direct E-utilities. The abstracts to be classified can be collected by searching PubMed for articles and saving the PubMed IDs (Figure 1), or, they can be fetched by entering a PubMed search term directly in BioReader. Please note that due to NCBI limitations, it is currently only possible to enter 1000 abstracts in each training category and the test list. We are working on increasing this number.
The results page is divided into three main parts:
The performance of BioReader depends heavily on the size of the training set, how well the training set captures the differences between classes, and the inherent ability of a given set to be separated into classes. We recommend at least 20 manually curated abstracts in each category, and ideally more than 100. We conducted a test on two relatively homologous groups of abstracts – all within the field of immunology, but some containing epitope information, and others not. Iteratively training a classifier and slowly increasing the training set size, we found that a lot is gained when increasing from 50-250 abstracts, but beyond that, performance gain is minimal (Figure 2).
BioReader can help you distinguish between relevant and irrelevant literature - even in cases where it is not obvious upon first glance. In fact, the more similar your negative articles are to your positive, the better the results - “subtle differences” is our speciality, and you probably don’t really need BioReader's help to tell the difference between psychology papers and plant science papers.
Similarly, it is essential that you keep the training set balanced (an equal number of abstracts in each training category). Ideally, a training sets will be in a 50/50 ratio between the two categories, since an over representation of either category will skew the classification. Adding more abstracts in one category will increase the recall of that category at the expense of the precision (Figure 3).
Lastly, it is important that you only attempt to separate into two categories in each run. It is possible to further categorize resulting categories in an iterative manner, e.g. first classify on diseases categories, followed by an additional classification on, say, the host organism.
Step 1. Abstract retrieval. The webserver offers a simple interface where users are prompted to upload three lists of PubMed IDs: two lists for the training categories (e.g. positive and negative for content of interest) as well as one PubMed search term for which abstracts will be classified, and/or a list of PubMed IDs corresponding to abstracts to be classified as belonging to one of the two groups. The abstracts are retrieved using NCBI’s Entrez programming utilities, E-utilities.
Step 2. Text pre-processing and corpus formation. Once abstracts are retrieved, the three text corpora are formed and the following operations are performed on the text: lowercase transformation, stop word removal, punctuation removal, word stemming, and whitespace stripping. As many gene names contain numeric characters, numbers are not removed. All of the above operations are performed using the “NLP” and “tm” packages for R.
Step 3. Document-term matrix formation and classifier training. After corpus formation, the texts are tokenized in document term matrices (DTM) and word counts are background corrected by term frequency-inverse document frequency (Tf-Idf) transformation (using the “tm” package for R). Terms in the DTMs are then reduced to the top 100 terms differentiating the two training classes, as determined by a Welch’s T test. The resulting training corpora DTMs are used to train and test ten different classification algorithms (support vector machine, elastic-net regularized generalized linear model, maximum entropy, scaled linear discriminant analysis, bagging, boosting, random forest, artificial neural network, decision tree, and naïve Bayes classifiers) available through the “RTextTools” and “e1071” (Naïve Bayes) packages for R. The best performing algorithm is determined by five-fold cross-validation on the training set and the documents to be classified are subsequently assigned one of the two classes using this algorithm.
BioReader is written in R with a Perl wrapper for fetching abstracts from PubMed. The source code for BioReader is available upon request. The script can be run from command line or within R, if you already have a set of abstracts downloaded.