Appendix B: Predictive Algorithms’ Performance on Holdout Test Data

This appendix provides a comprehensive overview of our predictive algorithms’ performance on every field in this study. For the relevance field (i.e., whether qualified immunity was raised on appeal), we calculated performance statistics based on the entire 187-opinion testing sample. For all other fields, we calculated performance statistics based on the 162 relevant opinions from the testing sample. We did not use these holdout test data at any point during the algorithm-development process. This ensured that the algorithms’ performance on these opinions was representative of their performance overall.

To measure performance, we used the following statistics:

Accuracy: Accuracy measures the number of correct predictions out of the total number of predictions for a field. For example, if an algorithm predicts a field 100 times and 91 predictions are correct, then the accuracy is 91%. Accuracy is our primary metric for evaluating text and categorical fields. Although we report accuracy for all predicted fields, other metrics—precision, recall, and F1 score—are better indicators of performance for binary (yes/no) fields.
Precision: In a nutshell, precision measures how good an algorithm is at avoiding false positives. For example, if an algorithm predicts 20 opinions involve state law enforcement officials, but only 18 actually do (meaning two were false positives), then the precision is 0.9 (18/20). The maximum value is 1. Precision is applicable only for binary fields.
Recall: Put simply, recall measures how good an algorithm is at picking out the field in question. For example, if there are 25 true interlocutory appeals and the algorithm correctly identifies 23 of them, then recall is 0.92 (23/25). The maximum value is 1. Recall is applicable only for binary fields.
F1 Score: The F1 score is a widely used performance metric that combines precision and recall into a single statistic by taking their harmonic mean.A harmonic mean is a mean that penalizes divergence between the values being averaged. For example, the harmonic mean of 0.8 and 0.8 is 0.8, but the harmonic mean of 0.6 and 1 is 0.75, because 0.6 and 1 have greater divergence than 0.8 and 0.8. A good F1 score indicates that both precision and recall performed well. The maximum value is 1.
Confusion Matrix: We also provide confusion matrices, which give additional context for an algorithm’s performance. Although not technically a statistic, a confusion matrix compares an algorithm’s predictions to the true values by putting them into a table. In our matrices, values shaded in green are correct predictions—that is, predictions that matched the true value. Blue cells are false positives (predicted positives but true negatives). Yellow cells are false negatives (predicted negatives but true positives), although for categorical variables, all incorrect cells are shaded yellow. Qualitative fields (e.g., plaintiffs, defendants) do not have confusion matrices.

Field

Type

Algorithm Prediction Method

Accuracy

Precision

Recall

F1 Score

Confusion Matrix

Truth: y-axis;

Prediction: x-axis

Basic Information

Relevance

(qualified immunity raised on appeal)

Binary

Rules-Model Hybrid

90.9%

0.980

0.914

0.95

	0	1
0	22	3
1	14	148

Circuit Court

Text

Rules-Based Extraction

100%

—

Circuit Court Case Number

Text

Rules-Based Extraction

99.4%

—

Opinion Date

Text (date)

Rules-Based Extraction

100%

—

Plaintiffs

Text

Rules-Based Extraction

99.4%

—

Defendants

Text

Rules-Based Extraction

100%

—

Judges

Text

Rules-Based Extraction

100%

—

District Court of Origin

Text

Rules-Based Extraction

99.4%

—

District Court Case Number

Text

Manually Coded

—

Case Origination Date

Text (date)

Manually Coded

—

Procedural Details

Appellants

Categorical

P – Plaintiffs

B – Both (cross-appellants)

D – Defendants

Rules-Based Prediction

99.4%

—

	P	B	D
P	107	1	0
B	0	4	0
D	0	0	50

Published

Binary

Rules-Based Prediction

100%

1.000

1.00

	0	1
0	99	0
1	0	63

En Banc

Binary

Rules-Based Prediction

100%

1.000

1.00

	0	1
0	160	0
1	0	2

Interlocutory Appeal

Binary

Rules-Based Prediction

99.4%

1.000

0.979

0.99

	0	1
0	114	0
1	1	47

Pro Se Plaintiffs (self-represented plaintiffs)

Categorical

0 – No pro se in lawsuit

ES – Pro se at earlier stage

1 – All pro se in appeal

Rules-Based Prediction

99.4%

—

	0	ES	1
0	122	0	0
ES	0	15	1
1	0	0	24

Case Stage (at time of appeal)**

Categorical

SJ – Summary Judgment

D – Dismissal

B – Both

PT – Post-trial

Rules-Model Hybrid

94.4%

—

	SJ	D	B	PT
SJ	108	0	5	1
D	0	30	0	0
B	2	1	8	0
PT	0	0	0	7

Government Defendant Type

Government Level of Defendants

Categorical

S – State

B – Both

F – Federal

Rules-Based Prediction

96.9%

—

	S	B	F
S	153	3	1
B	0	0	0
F	1	0	4

State Law Enforcement Defendants

Binary

Rules-Model Hybrid

95.1%

0.990

0.932

0.96

	0	1
0	58	1
1	7	96

Federal Law Enforcement Defendants

Binary

Rules-Model Hybrid

99.4%

0.000

–*

	0	1
0	161	1
1	0	0

State Prison Defendants

Binary

Rules-Model Hybrid

97.5%

1.000

0.875

0.93

	0	1
0	130	0
1	4	28

Federal Prison Defendants

Binary

Rules-Model Hybrid

99.4%

1.000

0.750

0.86

	0	1
0	158	0
1	1	3

Other Defendants

Binary

Rules-Based Prediction

90.7%

0.826

0.844

0.84

	0	1
0	109	8
1	7	38

Task Force Defendants

Binary

Rules-Based Prediction

100%

–*

	0	1
0	162	0
1	0	0

Constitutional Violation Type

First Amendment Violations

Binary

Rules-Model Hybrid

99.4%

1.000

0.969

0.98

	0	1
0	130	0
1	1	31

Religious Liberty Violations

Binary

Rules-Model Hybrid

100%

1.000

1.00

	0	1
0	158	0
1	0	4

Excessive Force Violations

Binary

Rules-Model Hybrid

98.1%

0.980

0.961

0.97

	0	1
0	110	1
1	2	49

False Arrest Violations

Binary

Rules-Model Hybrid

91.4%

0.975

0.750

0.85

	0	1
0	109	1
1	13	39

Illegal Search Violations

Binary

Rules-Model Hybrid

92.6%

0.714

0.71

	0	1
0	135	6
1	6	15

Procedural Due Process Violations

Binary

Rules-Model Hybrid

96.9%

0.737

1.000

0.85

	0	1
0	143	5
1	0	14

Care in Custody Violations

Binary

Model-Based Prediction

95.7%

0.941

0.727

0.82

	0	1
0	139	1
1	6	16

Parental Rights Violations

Binary

Model-Based Prediction

99.4%

1.000

0.800

0.89

	0	1
0	157	0
1	1	4

Employment Violations

Binary

Model-Based Prediction

95.7%

1.000

0.682

0.81

	0	1
0	140	0
1	7	15

Outcomes

Overall Prevailing Party

Categorical

P – Plaintiffs

M – Mixed

D – Defendants

Rules-Model Hybrid

96.3%

—

	P	M	D
P	30	3	0
M	1	26	0
D	1	1	100

Qualified Immunity Granted

Binary

Rules-Based Prediction

90.1%

0.914

0.91

	0	1
0	61	8
1	8	85

Qualified Immunity Denied

Binary

Rules-Based Prediction

92.6%

0.902

0.822

0.86

	0	1
0	113	4
1	8	37

Lack of Jurisdiction – Factual Dispute

Binary

Rules-Based Prediction

98.1%

1.000

0.667

0.80

	0	1
0	153	0
1	3	6

*Statistic cannot be calculated due to insufficient data.

**For the “case stage” field, the confusion matrix does not reflect the “other” category as it did not appear as either an actual or a predicted case stage.

Stay Informed

Wait