Appendix B: Predictive Algorithms’ Performance on Holdout Test Data
This appendix provides a comprehensive overview of our predictive algorithms’ performance on every field in this study. For the relevance field (i.e., whether qualified immunity was raised on appeal), we calculated performance statistics based on the entire 187-opinion testing sample. For all other fields, we calculated performance statistics based on the 162 relevant opinions from the testing sample. We did not use these holdout test data at any point during the algorithm-development process. This ensured that the algorithms’ performance on these opinions was representative of their performance overall.
To measure performance, we used the following statistics:
- Accuracy: Accuracy measures the number of correct predictions out of the total number of predictions for a field. For example, if an algorithm predicts a field 100 times and 91 predictions are correct, then the accuracy is 91%. Accuracy is our primary metric for evaluating text and categorical fields. Although we report accuracy for all predicted fields, other metrics—precision, recall, and F1 score—are better indicators of performance for binary (yes/no) fields.
- Precision: In a nutshell, precision measures how good an algorithm is at avoiding false positives. For example, if an algorithm predicts 20 opinions involve state law enforcement officials, but only 18 actually do (meaning two were false positives), then the precision is 0.9 (18/20). The maximum value is 1. Precision is applicable only for binary fields.
- Recall: Put simply, recall measures how good an algorithm is at picking out the field in question. For example, if there are 25 true interlocutory appeals and the algorithm correctly identifies 23 of them, then recall is 0.92 (23/25). The maximum value is 1. Recall is applicable only for binary fields.
- F1 Score: The F1 score is a widely used performance metric that combines precision and recall into a single statistic by taking their harmonic mean.A harmonic mean is a mean that penalizes divergence between the values being averaged. For example, the harmonic mean of 0.8 and 0.8 is 0.8, but the harmonic mean of 0.6 and 1 is 0.75, because 0.6 and 1 have greater divergence than 0.8 and 0.8. A good F1 score indicates that both precision and recall performed well. The maximum value is 1.
- Confusion Matrix: We also provide confusion matrices, which give additional context for an algorithm’s performance. Although not technically a statistic, a confusion matrix compares an algorithm’s predictions to the true values by putting them into a table. In our matrices, values shaded in green are correct predictions—that is, predictions that matched the true value. Blue cells are false positives (predicted positives but true negatives). Yellow cells are false negatives (predicted negatives but true positives), although for categorical variables, all incorrect cells are shaded yellow. Qualitative fields (e.g., plaintiffs, defendants) do not have confusion matrices.
|
Field |
Type |
Algorithm Prediction Method |
Accuracy |
Precision |
Recall |
F1 Score |
Confusion Matrix Truth: y-axis; Prediction: x-axis |
||||||||||||||||||||||||
Basic Information |
Relevance (qualified immunity raised on appeal) |
Binary |
Rules-Model Hybrid |
90.9% |
0.980 |
0.914 |
0.95 |
|
||||||||||||||||||||||||
Circuit Court |
Text |
Rules-Based Extraction |
100% |
— |
— |
— |
— |
|||||||||||||||||||||||||
Circuit Court Case Number |
Text |
Rules-Based Extraction |
99.4% |
— |
— |
— |
— |
|||||||||||||||||||||||||
Opinion Date |
Text (date) |
Rules-Based Extraction |
100% |
— |
— |
— |
— |
|||||||||||||||||||||||||
Plaintiffs |
Text |
Rules-Based Extraction |
99.4% |
— |
— |
— |
— |
|||||||||||||||||||||||||
Defendants |
Text |
Rules-Based Extraction |
100% |
— |
— |
— |
— |
|||||||||||||||||||||||||
Judges |
Text |
Rules-Based Extraction |
100% |
— |
— |
— |
— |
|||||||||||||||||||||||||
District Court of Origin |
Text |
Rules-Based Extraction |
99.4% |
— |
— |
— |
— |
|||||||||||||||||||||||||
District Court Case Number |
Text |
Manually Coded |
— |
— |
— |
— |
— |
|||||||||||||||||||||||||
Case Origination Date |
Text (date) |
Manually Coded |
— |
— |
— |
— |
— |
|||||||||||||||||||||||||
Procedural Details |
Appellants |
Categorical P – Plaintiffs B – Both (cross-appellants) D – Defendants |
Rules-Based Prediction |
99.4% |
— |
— |
— |
|
||||||||||||||||||||||||
Published |
Binary |
Rules-Based Prediction |
100% |
1.000 |
1.000 |
1.00 |
|
|||||||||||||||||||||||||
En Banc |
Binary |
Rules-Based Prediction |
100% |
1.000 |
1.000 |
1.00 |
|
|||||||||||||||||||||||||
Interlocutory Appeal |
Binary |
Rules-Based Prediction |
99.4% |
1.000 |
0.979 |
0.99 |
|
|||||||||||||||||||||||||
Pro Se Plaintiffs (self-represented plaintiffs) |
Categorical 0 – No pro se in lawsuit ES – Pro se at earlier stage 1 – All pro se in appeal |
Rules-Based Prediction |
99.4% |
— |
— |
— |
|
|||||||||||||||||||||||||
Case Stage (at time of appeal)** |
Categorical SJ – Summary Judgment D – Dismissal B – Both PT – Post-trial |
Rules-Model Hybrid |
94.4% |
— |
— |
— |
|
|||||||||||||||||||||||||
Government Defendant Type |
Government Level of Defendants |
Categorical S – State B – Both F – Federal |
Rules-Based Prediction |
96.9% |
— |
— |
— |
|
||||||||||||||||||||||||
State Law Enforcement Defendants |
Binary |
Rules-Model Hybrid |
95.1% |
0.990 |
0.932 |
0.96 |
|
|||||||||||||||||||||||||
Federal Law Enforcement Defendants |
Binary |
Rules-Model Hybrid |
99.4% |
0.000 |
–* |
–* |
|
|||||||||||||||||||||||||
State Prison Defendants |
Binary |
Rules-Model Hybrid |
97.5% |
1.000 |
0.875 |
0.93 |
|
|||||||||||||||||||||||||
Federal Prison Defendants |
Binary |
Rules-Model Hybrid |
99.4% |
1.000 |
0.750 |
0.86 |
|
|||||||||||||||||||||||||
Other Defendants |
Binary |
Rules-Based Prediction |
90.7% |
0.826 |
0.844 |
0.84 |
|
|||||||||||||||||||||||||
Task Force Defendants |
Binary |
Rules-Based Prediction |
100% |
–* |
–* |
–* |
|
|||||||||||||||||||||||||
Constitutional Violation Type |
First Amendment Violations |
Binary |
Rules-Model Hybrid |
99.4% |
1.000 |
0.969 |
0.98 |
|
||||||||||||||||||||||||
Religious Liberty Violations |
Binary |
Rules-Model Hybrid |
100% |
1.000 |
1.000 |
1.00 |
|
|||||||||||||||||||||||||
Excessive Force Violations |
Binary |
Rules-Model Hybrid |
98.1% |
0.980 |
0.961 |
0.97 |
|
|||||||||||||||||||||||||
False Arrest Violations |
Binary |
Rules-Model Hybrid |
91.4% |
0.975 |
0.750 |
0.85 |
|
|||||||||||||||||||||||||
Illegal Search Violations |
Binary |
Rules-Model Hybrid |
92.6% |
0.714 |
0.714 |
0.71 |
|
|||||||||||||||||||||||||
Procedural Due Process Violations |
Binary |
Rules-Model Hybrid |
96.9% |
0.737 |
1.000 |
0.85 |
|
|||||||||||||||||||||||||
Care in Custody Violations |
Binary |
Model-Based Prediction |
95.7% |
0.941 |
0.727 |
0.82 |
|
|||||||||||||||||||||||||
Parental Rights Violations |
Binary |
Model-Based Prediction |
99.4% |
1.000 |
0.800 |
0.89 |
|
|||||||||||||||||||||||||
Employment Violations |
Binary |
Model-Based Prediction |
95.7% |
1.000 |
0.682 |
0.81 |
|
|||||||||||||||||||||||||
Outcomes |
Overall Prevailing Party |
Categorical P – Plaintiffs M – Mixed D – Defendants |
Rules-Model Hybrid |
96.3% |
— |
— |
— |
|
||||||||||||||||||||||||
Qualified Immunity Granted |
Binary |
Rules-Based Prediction |
90.1% |
0.914 |
0.914 |
0.91 |
|
|||||||||||||||||||||||||
Qualified Immunity Denied |
Binary |
Rules-Based Prediction |
92.6% |
0.902 |
0.822 |
0.86 |
|
|||||||||||||||||||||||||
Lack of Jurisdiction – Factual Dispute |
Binary |
Rules-Based Prediction |
98.1% |
1.000 |
0.667 |
0.80 |
|
*Statistic cannot be calculated due to insufficient data.
**For the “case stage” field, the confusion matrix does not reflect the “other” category as it did not appear as either an actual or a predicted case stage.