The Data We Acquired From Using LLMs to Support Thematic Analysis

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Jakub DRÁPAL, Institute of State and Law of the Czech Academy of Sciences, Czechia, Institute of Criminal Law and Criminology, Leiden University, the Netherlands;

(2) Hannes WESTERMANN, Cyberjustice Laboratory, Université de Montréal, Canada;

(3) Jaromir SAVELKA, School of Computer Science, Carnegie Mellon University, USA.

Table of Links

Abstract & Introduction

Results and Discussion

Conclusions, Future Work and References

3. Dataset

In our experiments, we used a dataset of 785 facts descriptions from cases of Czech courts decided in 2017. From the Prosecution Service, we received 834 cases that found an adult defendant guilty of theft. In Czechia, theft also includes burglary and pickpocketing.[2] We slightly over-represented the most serious offenses to ensure a sufficient number of cases in the dataset.

We removed 49 cases from the dataset because they were used in a pilot study or due to them containing errors. We extracted text describing the facts. Each extracted text was anonymized and shortened or partially re-written if necessary.

The resulting text snippets range from 73 to 29,695 characters in length (1Q 447, median 782, 3Q 1,462 characters). Figure 2 shows an example (automated translation).

A group of three law students under the supervision of one of the authors of this paper manually conducted an unstructured variant of thematic analysis.[3] The group arrived at 14 high-level themes focused on modus operandi and target of committed thefts (Figure 2).

For each facts description a single theme was independently chosen by two students according to specified rules.

The disagreements were resolved by one of the students following careful re-reading of the case. The distribution of the themes over the 785 facts descriptions included in the dataset is presented in Figure 2. The theft in a shop (29.0%) and breaking into another object (17.5%) are the most prevalent themes.

[2] ICCS codes 0501 and 0502 except for 0502212 [9].

[3] We did not rigorously adhere to the process described in [5].

This paper is available on arxiv under CC 4.0 license.6