About Sysrev's prediction model

Sysrev will make predictions about the likely label answer for all Boolean and categorical labels (including the default include/exclude label) based on existing label answers provided by human reviewers in a project. In this way, Sysrev supports prioritized screening such that reviewers can review records based on their likelihood of inclusion or exclusion in a review, speeding up the review process and allowing for bulk exclusion of records should review teams choose to take that approach.

Sysrev uses a Stochastic Gradient Descent model for prediction. Stochastic Gradient Descent (SGD) is an iterative optimization algorithm that updates model parameters using the gradient from a single randomly selected data point or small batch.This model can be used for multi-label classification (such as categorical labels) and is fast and scalable.

How and when the model runs

Sysrev will automatically run the prediction model every hour, in general, when sufficient reviewer labels have been added to a project. Predictions are not generated if there's not enough class coverage (i.e., if you only mark labels true or you don't use every categorical value), however, as soon as there is at least one reviewer answer for all possible label answers in a given label, the model will generate predictions. That said, the model need sufficient data before it can make good predictions. The more human reviewer data it has, the better the results.

For categorical labels, the prediction model will treat each category as a yes/no Boolean label. In other words, for each category option in a categorical label, it will predict how likely an article is to be labeled with a given category.