Appendix A: Developing the Predictive Algorithms
In this appendix, we describe our process for developing the predictive algorithms, beginning with how we turned the opinions into data. We then describe how we built the algorithms. And we conclude with how we ran the final algorithms on all 7,173 opinions in our dataset.
Turning Opinions Into Data
Before we could build our predictive algorithms, we needed to turn the opinion text into data the algorithms could manipulate. To do this, we imported, indexed, and annotated the text of the opinions as described below.
Importing the Opinions
After downloading the opinions (in Rich Text Format) from Westlaw into a local directory, we read them into a Python 3.7 development environment using the striprtf Python library. 1 One small complication was that Westlaw frequently stores important information as hyperlinks, which the striprtf reader was unable to parse. To fix this, we wrote a short function that converted the hyperlinks into plain text.
At the end of the importing process, we were left with an unorganized block of text like the example shown in Figure A1.
Figure A1: Initial Format of an Imported Opinion
© 2021 Thomson Reuters.
Hierarchical Indexing
Because these text blocks were difficult to use and manipulate, our next step was to organize them using hierarchical indexing. Once we successfully read an opinion into our development environment, we structured it into an ordered hierarchy that consisted of sections, paragraphs, sentences, and words. 2 Through this structuring, we assigned each word in the opinion a unique index or location—for example, a word might be the fifth word in the first sentence of the third paragraph of the introduction section.
For the paragraph, sentence, and word indexing, we used standard methods (e.g., splitting on newline characters) or pre-programmed functions from the Natural Language Toolkit (NLTK) library in Python. 3 However, we created the section indexing specific to the structure of the opinions as initially downloaded from Westlaw. It included (among other sections) the header, the Westlaw synopsis/background (if applicable), the main text of the opinion, the footnotes, and any concurrences and dissents (which Westlaw often includes within the same document). In addition, we broke out both the introduction and conclusion of the opinion section using separate subfunctions. We stored each word and its accompanying index in tabular form, as illustrated in Table A1.
Table A1: Example of Indexed Data Stored in Tabular Form
Word |
Section |
Para Num |
Sent Num |
Word Num |
Qualified |
Opinion |
1 |
1 |
1 |
immunity |
Opinion |
1 |
1 |
2 |
frequently |
Opinion |
1 |
1 |
3 |
involves |
Opinion |
1 |
1 |
4 |
alleged |
Opinion |
1 |
1 |
5 |
1st |
Opinion |
1 |
1 |
6 |
Amendment |
Opinion |
1 |
1 |
7 |
violations. |
Opinion |
1 |
1 |
8 |
These |
Opinion |
1 |
2 |
1 |
violations |
Opinion |
1 |
2 |
2 |
are |
Opinion |
1 |
2 |
3 |
often |
Opinion |
1 |
2 |
4 |
alleged |
Opinion |
1 |
2 |
5 |
against |
Opinion |
1 |
2 |
6 |
non-police, |
Opinion |
1 |
2 |
7 |
non-prison |
Opinion |
1 |
2 |
8 |
defendants. |
Opinion |
1 |
2 |
9 |
This indexing allowed us to group the text in different ways. For example, in certain cases, it was helpful to keep each word stored separately. In others, it was preferable for each row in the table to list an entire sentence. Using the indexing, we could easily change the table above into groups of sentences, paragraphs, or even entire sections. Table A2 shows how the data from Table A1 could be grouped into sentences instead of broken out by word.
Table A2: Example of Indexed Data Grouped by Sentence
Sentence | Section | Para Num | Sent Num |
---|---|---|---|
Qualified immunity frequently involves alleged 1st Amendment violations. | Opinion | 1 | 1 |
These violations are often alleged against non-police, non-prison defendants. | Opinion | 1 | 2 |
Model-based approaches for prediction (described in more detail below) were more likely to rely on tables with each word listed separately, while rules-based approaches were more likely to rely on tables with sentences or entire paragraphs grouped together.
Annotating the Opinions
Aside from hierarchical indexing, we added several annotations to each word in a table:
- The word with all capitalization and punctuation removed.
- The part of speech. We added these annotations using the NLTK library’s part-of-speech tagger. 4
- A binary indicator for whether the word included (or was) a number. For example, if “1st” appeared in the text, we tagged the word as including a number.
- The word stem or root. For example, the stem of the word “drafting” is “draft.” We used the Porter word stemmer from the NTLK library to perform the stemming.
- A binary indicator for whether the word was a “stop” word—that is, a word that has little inherent meaning or performs a mostly grammatical function (e.g., “a,” “the,” “which,” etc.). We used the NLTK library’s built-in list of stop words supplemented with our own manually developed list of words that appear frequently in legal opinions. 5
The result was a table that included each word, its hierarchical index location, and all additional annotations. 6 For a brief sample of what this looked like, see Table A3, where blue represents the hierarchical index and green represents the additional annotations. 7
Table A3: Example of Indexed Data With Annotations
Word | Section | Para Num | Sent Num | Word Num | Term | Part of Speech | Number? | Word Stem | Stop Word? |
---|---|---|---|---|---|---|---|---|---|
Qualified | Opinion | 1 | 1 | 1 | qualified | adj. | 0 | qualif | 0 |
immunity | Opinion | 1 | 1 | 2 | immunity | noun | 0 | immun | 0 |
frequently | Opinion | 1 | 1 | 3 | frequently | adv. | 0 | frequent | 0 |
involves | Opinion | 1 | 1 | 4 | involves | verb | 0 | involv | 0 |
alleged | Opinion | 1 | 1 | 5 | alleged | adj. | 0 | alleg | 0 |
1st | Opinion | 1 | 1 | 6 | 1st | adj. | 1 | 1st | 0 |
Amendment | Opinion | 1 | 1 | 7 | amendment | noun | 0 | amend | 0 |
violations. | Opinion | 1 | 1 | 8 | violations | noun | 0 | violat | 0 |
These | Opinion | 1 | 2 | 1 | these | pronoun | 0 | these | 1 |
violations | Opinion | 1 | 2 | 2 | violations | noun | 0 | violat | 0 |
are | Opinion | 1 | 2 | 3 | are | verb | 0 | are | 1 |
often | Opinion | 1 | 2 | 4 | often | adv. | 0 | often | 0 |
alleged | Opinion | 1 | 2 | 5 | alleged | verb | 0 | alleg | 0 |
against | Opinion | 1 | 2 | 6 | against | prep. | 0 | against | 0 |
non-police, | Opinion | 1 | 2 | 7 | non-police | noun | 0 | non-police | 0 |
non-prison | Opinion | 1 | 2 | 8 | non-prison | noun | 0 | non-prison | 0 |
defendants. | Opinion | 1 | 2 | 9 | defendants | noun | 0 | defend | 1 |
Ultimately, we did not use all the annotations in our predictive algorithms. For example, we generally found that word stems hurt performance during the development process, so we avoided including them in our final algorithms. Nevertheless, having these annotations available allowed for more flexibility in the algorithm-development process.
Building the Predictive Algorithms
Once we had turned the opinions into data and indexed and annotated them, we could build our predictive algorithms. We randomly sampled roughly 11% of our 7,173-opinion dataset for model development and evaluation, using stratified sampling to ensure each year was proportionally represented. 8 We hand coded these 791 opinions across 34 fields and assigned them to one of two samples using random stratified sampling by year:
- Training sample: This consisted of 604 opinions we used to develop the predictive algorithms. We further divided these opinions into two sub-groups, again using random stratified sampling by year:
- Primary training sample: We used 529 opinions to build the predictive algorithms.
- Validation sample: We used 75 opinions to assess performance during the development process and fine-tune the algorithms.
- Testing sample: This consisted of 187 opinions we used solely for the final evaluation of the algorithms’ performance. Because we used these only once the algorithms were final, these holdout test data were not part of the development process.
We used the primary training sample to build our predictive algorithms, constructing the algorithms for each field separately. To do so, we used four different approaches:
Rules-based extraction: We extracted, rather than predicted, seven of the 34 fields. This method of obtaining information began with narrowing the search area. For example, most basic qualitative information was present at the top of an opinion, so we searched only headers for these fields.
Next, we used “regular expressions,” a special programming tool that enables powerful and focused text searches, to find and extract the desired information. 9
For example, one of our simplest rules-based extraction algorithms targeted the circuit court that issued the opinion. To find this information, we searched an opinion’s header for a paragraph that started with “United States Court.” Using regular expressions, we then extracted the text that came immediately after that phrase in the paragraph (e.g., “United States Court, Sixth Circuit”). If no further text was present in the paragraph, we took the first words of the line immediately after, as those typically referred to the circuit court in question.
Rules-based prediction: In addition to using rules to extract information, we used them to predict 11 of the 34 fields. These rules most often took the form of “if . . . then” statements: If certain phrases were found in a particular part of an opinion, then we would classify the opinion a certain way. For example, one of our most effective rules for predicting whether an opinion involved qualified immunity was a search for the court’s recitation of the two-pronged standard of review—that is, something like “(1) . . . constitutional violation . . . (2) . . . clearly established.” 10 If this text pattern appeared, we classified the opinion as involving qualified immunity.
Rules-based approaches relied heavily on hierarchical indexing (and especially the section indexing) to narrow the search area, as specific pieces of information often showed up in certain parts of an opinion document. For example, defendants frequently appeared in the first paragraph of the main section of an opinion. To predict whether an opinion dealt with state law enforcement defendants, we therefore searched sentences in the opinion’s introduction for phrases such as “against . . . police.”
We again used regular expressions to create flexible searches. Sometimes, this flexibility meant searching for many synonyms or variations. In the state law enforcement defendant example above (“against . . . police”), we searched for several similar law enforcement officials in addition to police: “highway patrol officers,” “state troopers,” “sheriffs,” and so forth. We also used regular expressions to avoid false positives. Continuing with the same example, if the word “police” was preceded by “capitol,” we did not count it as a match since Capitol Police are federal, not state, officials.
Given the variety of language used by courts, individual rules like these rarely found all or even most opinions with a characteristic we sought. Instead, we often strung separate rules together, with each rule tuned to avoid false positives. For example, our algorithm for predicting whether an appeal dealt with state law enforcement defendants used eight separate rules, all tuned to avoid false positives.
Model-based prediction: Beyond rules, we used statistical models to predict three of our fields.
To build these models, we used algorithms such as naïve bayes, penalized logistic regression (generally, ridge logistic regression), and support vector machines. (We used different models for different fields.) Specifically, we fed into the models small collections of highly relevant words and phrases identified using feedback from our experienced qualified immunity attorneys and data exploration methods. 11 For example, our model for predicting whether an opinion dealt with excessive force violations included only five highly specific inputs: three related to common “use of force” phrases such as “excessive force” and “deadly force,” one related to the standard of review for prison excessive force appeals, and one related to overly tight handcuffs.
Since words and phrases will usually appear more often in longer opinions than in shorter ones, we normalized the frequency counts by dividing by the number of words in the opinion. 12 For example, if a word appeared three times in a 500-word opinion, we would divide the raw frequency (3) by the number of words (500).
In addition to dividing by the word count, we divided by the word’s standard deviation in our primary training sample. 13 This step was necessary given how we compiled our model inputs. 14
To achieve satisfactory performance, we tuned our models using five-fold cross-validation. 15 We used cross-validation to pick the model type, engineer features (i.e., choose model inputs), and calibrate various “knobs and dials” on certain models. 16 We also conducted internal assessments of performance using the validation sample as an additional assurance that the models worked well on unseen opinions. Although we did use naïve bayes and support vector machine models for certain fields, we generally found, through cross-validation, that ridge logistic regression models performed best. 17
Rules-model hybrids: For the remaining 13 of the 34 fields, we used a hybrid method involving both rules and models. Typically, we integrated the two approaches by tuning both the rules and the models to avoid false positives. If the model predicted that a field was present, that became the final prediction. For example, if the model predicted the opinion dealt with a First Amendment violation, that is how the algorithm classified it.
But if the model predicted that the field was not present—that the opinion did not deal with a First Amendment violation—then the algorithm turned to a rules-based method to classify the opinion.
This approach proved effective because the rules and models tended to target slightly different types of opinions: Models worked well when distinctive vocabularies were present throughout an opinion, while rules worked well when highly specific phrases were present.
For example, the model that predicted whether an opinion involved qualified immunity was excellent when an opinion extensively discussed qualified immunity and clearly established rights. But not all relevant opinions featured substantial discussion of qualified immunity. In some, the court focused on other issues but then granted qualified immunity in the alternative in a footnote using a highly specific language pattern such as “in the alternative . . . we hold defendants are entitled to qualified immunity.” This is not enough text for a model to make reliable predictions, but we could write a rule to find patterns like it.
Running the Predictive Algorithms
Once the algorithms were complete, we put them into a broader Python function that created the dataset. This function first read in the opinion, organizing and indexing it as described above. Next, it sequentially predicted the 34 fields. The function repeated this process for each opinion in the file directory, resulting in a final dataset that included predictions for every opinion and field.
Also once the algorithms were complete, we formally evaluated their effectiveness using the 187-opinion holdout test sample. Comprehensive statistics for how each field performed on this holdout test set are available in Appendix B.
Finally, after we assessed the algorithms’ performance, we performed minor manual cleanup on our final dataset. 18 Readers interested in the programming code we used to generate the final dataset can contact us here. 19