Skip to content

Melvin Tjon Akon

Category: Dutch Law

QNL #4: Dutch Case Law Brief, November 16 – December 30, 2015

Posted on 01/03/2016 - 01/04/2016 by mtjonakon

As some time has passed since I posted the last QNL, in this QNL I will analyze the last six weeks of Dutch jurisprudence, from November, 16 to December, 30.

Besides the weekly statistics, I will zoom in on the utility of the website’s categorization (uitspraken.rechtspraak.nl) of cases for corporate finance lawyers. In particular, I use a Latent Dirichlet Allocation (LDA) model to analyze potential improvements of this categorization. These models have the potential to improve the categorization.

Cases by institution

The total number of cases published in the last six weeks is 2389. As expected (see previous posts), most published decisions are issued the Raad van State (RVS) and Centrale Raad van Beroep (CRVB):

casesbyinstitution

A closer look at these twin peaks shows that the majority of RVS decisions are in the area of administrative law (80.3%, 347 of 432). This makes sense, as the General Administrative Law (Algemene Wet Bestuursrecht) confers jurisdiction to the RVS in a wide range of administrative law areas. However, the category is broad and therefore not very informative.  The majority of CRVB decisions are in the area of social security law (88.8%, 357 of 402). Again, a very broad category. The Supreme Court has published 178 cases in this period, but approximately 25 of those decisions are RO 81 decisions (see this post for an explanation of this type of decisions).

Cases by area of law

The top 5 areas of law in terms of volume, have not changed either. Administrative law, civil law and penal law comprise the top 3, followed by social security law and tax law. Since nearly all of the remaining areas of law can be subsumed under the three front-runners, the tail of the frequency distribution is not very interesting.

casesbyareaoflaw

Throughput time

Plotting the elapsed time between the date of the verdict and the date of publication for each case, reveals that courts publish their decisions on the same day. There are a number of outliers at the far end. In three instances, more than 26 days elapsed between the verdict and the publication of a decision of the Amsterdam Court of Appeals. Other institutions whose decisions are at that  far end of the distribution are a number of civil courts of first instance and the Arnhem-Leeuwarden Court of Appeals. Note that this is not a critique, especially since the underlying reasons for the delay are not known.

throughput time

Topic models: towards an improvement of the case categorization interface 

Instead of a plain vanilla n-gram frequency analysis, I focus on another useful technique for categorization of court cases: topic modelling using Latent Dirichlet Allocation (LDA).

The case for improving the website’s categorization of court cases is compelling. Try using the website’s advanced search interface (link). The official categories may be useful for doctrinal purposes, but for the corporate finance practitioner they are impractical. Whether his practice is more focused on litigation or on transactions, this practitioner  is confronted with a framework of rules that does not map on categorizations of European versus national origin, private law versus public law and private law versus penal law. Instead, the day-to-day legal analysis in the finance practice revolves around particular legal acts and concepts: offering securities, providing security, director liability, to name a few. The legal rules applicable to these acts and concepts stem from various doctrinal categories, but those categories matter less for practical purposes. In order to promote information efficiency, it makes sense to tailor the official categorizations to the needs of legal practice.

Lawyers with a data science background could play an important role here.

For a thorough explanation of topic modeling using LDA, I can recommend reading David Blei‘s work. Generally, topic modeling starts from the assumption that documents exhibit multiple topics. More specifically:

  • each topic is a distribution over words;
  • each document is a mixture of corpus-wide topics; and
  • each word in the document is drawn from one of those topics.

The goal is to infer hidden variables (topic structure, i.e. topics, proportions of topics per document and topic assignments per documents) from the documents, which consist of observed variables (words). If the previous assumptions hold, the LDA is the generative process by which a joint probability distribution of the hidden and observed variables comes about. In order to compute the topic distribution a Bayesian statistical model is used and Gibbs sampling is used to approximate that model with  empirical data.

A topic model for finance-relevant case law

Fitting a LDA topic model to a subset of the case law, a list of finance topics could be generated. In order to narrow the list of possible topics, I narrow the set of cases to cases with references to the Civil Code (Burgerlijk Wetboek) or the Financial Supervision Law (Wet op het financieel toezicht), 476 cases in total. The following topics are estimated:

Topic 0: verweerder werknemer verzoek kantonrechter arbeidsovereenkomst verzoeker verzoekster

Topic 1: appellant hof geintimeerde beroep geintimeerd hoger grief

Topic 2: moeder minderjar vader de hof minderjarige man

Topic 3: man vrouw hof rechtbank per partij bedrag

Topic 4: of verdacht slachtoffer en rechtbank ten medeverdacht

Topic 5: eiser verweerder percel rechtbank artikel gemachtigd verzoek

Topic 6: the cramm of to and interpolis har

Topic 7: deskund rapport achmea schad har rechtbank of

Topic 8: artikel besluit beroep uitsprak lid die niet

Topic 9: eiser naam har tzl bestuurder gedaagd eo

Topic 10: naam afm niet accountantskamer dsb colleg verwijt

Topic 11: gedaagd eiser eiseres gedaagde vorder sub vonnis

Topic 12: bedrijf en of stichting medeverdacht obligatiehouder bestuurder

Topic 13: gemeent appellante huurovereenkomst partij woning huurder overeenkomst

Topic 14: niet die als om ook dit dez

Topic 15: bedrag overeenkomst betal rechtbank niet dez artikel

Topic 16: art vorder zak shell milieudefensie getuig die

Topic 17: gedaagde artikel propertiz dahabshiil nam websit bkr

Topic 18: of en belegger bedrag verdacht daaromtrent onder

Topic 19: bank beher abn vorder amro sns rabobank

Among the found topics, some reveal a potentially interesting set of topic words. Topics 10, 12, 15, 18 and 19 seem especially relevant to the corporate finance practitioner. However, the collection of documents in the subset (small) and the time period of six weeks (short) may introduce some bias in the list of topic words. That being said, with some refinements LDA topic models could improve the informational efficiency of the website’s categorization.

Posted in Case Law Analysis, Data Science, Dutch Law, Judicial BehaviorTagged judicial behavior, latent dirichlet allocation, natural language processing, python, text analytics, topic models

QNL #2: Dutch Case Law Brief, October 26 – November 1, 2015

Posted on 11/09/2015 - 11/09/2015 by mtjonakon

In this QNL, I assess rulings issued by Dutch legal institutions published on www.rechtspraak.nl during the last week of October 2015 using Python‘s natural language processing, data analysis and machine learning packages.

Number of cases by institution 

Between October 26 to November 1, 2015, a total of 262 new court rulings were published (list), according to a request via the site’s API. The graph below shows the number of rulings per issuing instution.

METRIC 1 - TOTAL INSTITUTIONS

The Council of State (RVS, Raad van State) issued the highest number of rulings (48), followed by the Supreme Court (39) and the Appellate Court of ‘s-Hertogenbosch (33). Summing by institutional level, a predictable pattern emerges. The courts of first instance issued the greatest number of decisions, followed by the appellate courts. The relative shares are shown below.

METRIC 4 - DECISIONS BY LEVEL

Although the Attorney General (‘procureur generaal’, art. 111 Wet op de rechterlijke organisatie) technically is not an independent judicial body and only issues writes advisory opinions, I include those opinions in the analysis. The advisory opinions of the Attorney General are authoritative in Dutch legal circles, as they often provide guidance for judges at the trial and appellate levels.

It should be noted that the most rulings issued by the Supreme Court (22 of 39) are “article 81 RO” rulings. In such a ruling, the Supreme Court decides that (1) neither the appellant’s complaints are sufficient to warrant reversal of the appellate court’s decision,  (2) nor is publication of the Supreme Court’s decision required for the sake of legal uniformity or development of the law. In addition, in a smaller number of rulings (4 of 39), the Supreme Court dismissed the complaint for inadmissibility. 

Number of institutions by area of law

This week, most rulings involved criminal law (61), followed by civil law (52) and administrative law (46). As discussed in a prior post, most cases tend to fall in these general areas of law, rather than more narrowly defined areas such as employment law (5) and business law (2). However, any 1L (first year law student) knows that it’s not the category that matters, but the legal principle discussed (see more below). A graph with the legal categories and frequency counts for this week is posted below.

METRIC 2 - DECISIONS BY AREA

Throughput time

How much time elapses between the decision date and the publication date?

METRIC 3 - THROUGHTPUT

Most cases are published on the date of the actual decision (150 of 262 or 58%), but in 38% (100 of 262) of the cases it takes 1-3 days to publish the decision. If it took 1 day, it was most likely the Appellate Court of ‘s-Hertogenbosch (GHSHE, 22) or the Central Council of Appeal (CRVB, 13). A salient fact is that those courts handled a disproportionately large number of family law (GHSHE) and social security law (CRVB) cases. It’s merely salient: I did not compute any correlations between area of law and throughput time, which would be a poor measure for several reasons.

N-gram analysis

The analysis of n-grams are a fundamental element of any data analysis involving text. Roughly speaking, n-grams are sets of words that occur in the text, consisting of one (unigram), two (bigram) or three (trigram) words. In an n-gram analysis, the documents of the corpus are converted to a matrix with a vector for each n-gram, which facilitates the application of quantitative clustering or classification techniques (in combination with NLP tools such as a lemmatizer). The matrix can contain interesting information, such as the Most Important N-Gram (MING). With the MING approach, textual differences and similarities become apparent. Prior to using MING to classify texts (future posts), let’s contrast the most frequent unigrams in all civil law (52) and tax law (23) cases (top 50 unigrams, bigrams and trigrams).

METRIC 5 Trigrams

What Does The Trigram Analysis Reveal?  The main advantage of n-gram analysis is its simplicity. The trigram analysis shows that most tax cases are Supreme Court cases (“beroep in cassatie”, “de hog rad”) and refer more often to precedents (“ecli”), while civil law cases contain more statements concerning the parties (“appellant”, “geïntimeerde”). The unigram analysis shows that demonstrative pronouns (“die”), connectors (“of”), plaintiff (“appellant”), appellate court (“hof”) and lawyer (“advocaat”) occur frequently throughout the corpus.

The tables clearly illustrate the main drawbacks of independent n-gram analyses. The first issue concerns “non-informative” words. Generally, courts use some nouns (e.g.  “plaintiff”, “Court”) and pronouns (“that”, “he”) more often than words that are “informative” because they are only used in a particular subset of cases. For example, the nouns “bankruptcy” or “agreement” are less frequently used in all texts, but may occur more often in a particular subset of the corpus. The second issue concerns synonyms and pronouns. A court may switch between the words “shareholder”,  “he” and “company” while referring to a parent corporation in a corporate group. The third and most important drawback is the loss of context.  Even in the highly unlikely case that courts use only informative words and are consistent in their use of words, n-gram analyses ignore the textual context in which these words appear. The analyst only knows whether the court used the word(s), not how they were used. Yet, one of the most important characteristics of legal rules is that they establish legal relations between legal concepts.

In upcoming posts, I address these issues by adding new layers of legal analytics to the QNL. Feel free to post comments or send me a message!

 

 

 

 

 

Posted in Case Law Analysis, Data Science, Dutch Law, Geen categorie, Judicial Behavior, Legal AnalyticsTagged beautiful soup, judicial behavior, rechtspraak.nl, text analytics

QNL #1: Dutch Case Law Brief, October 12-18, 2015

Posted on 10/29/2015 - 01/04/2016 by mtjonakon

In this post, I want to share with you the results of a program involving natural language processing and machine learning tools. The aim of the program is to provide insights into Dutch case law developments on a weekly basis, using case law published on www.rechtspraak.nl. As the program is still in beta, feel free to comment or make suggestions. I use Python as a main language and use several different packages for data analysis and visualization.

1. Number of cases by institution and area of law

I collected the ECLI numbers of all cases published between October 12-18 via site’s API (more info) and downloaded the contents of the cases on a local hard drive. In total, 301 cases were published (list).

Number of decisions by institution
The top 4 institutions (in terms of total cases published) are “Raad van Staete” (RVS, 98), “Centrale Raad voor Beroep” (CRVB, 34), and the Supreme Court of the Netherlands (“Hoge Raad”, HR, 24), tied with “Gerechtshof ‘s-Hertogenbosch” (GHSHE, 24). A comparison by institutional group shows the expected pattern (# decision by lower courts versus HR):
Decisions by Institutional Group
Plotting the distribution of legal areas on the x-axis:
Table 2 - Number of decisions by area of law.
Clearly, the top areas of law are “Bestuursrecht”, “Strafrecht”, “Socialezekerheidsrecht” and “Civiel recht”. I used the classification by area of law as designated by the judiciary in the metaterms of each case. In later reports I intend to develop an alternative, text-based classification.

2. Court-specific metrics: output and throughput

While it would be interesting to analyze the workload of individual courts, rechtspraak.nl does not publish the number of filings at each court. As the United States Courts website illustrates, providing those numbers is feasible. Without filing info, analyzing workload is impossible. Instead, I look at the throughput time: the time it takes the court to publish the decision after issuing the verdict. An interesting question, to consider in the future, is its relation to decision length and case complexity.

Throughput_Hist

In most cases, the decision is published on the same day. There does not seem to be a general relation between the throughput time and area of law (PUB_area) or court (PUB_instantie). More data is needed to unearth the explanation for this distribution.

3. Text features: basic unigram classification

In terms of structure, court decisions are simply long and unstructured text strings. There are numerous natural language processing techniques which facilitate statistical analysis of these text strings by extracting features, some of which I will use in the future (e.g. POS Tag sequences, Hidden Markov Models, Deep Learning, ontologies). In this QNL, I restrict the analysis to “unigrams” or individual words. This bag-of-words approach is simplistic, but suffices for exploratory purposes. After preprocessing (Feldman & Sanger 2007), I count the word frequency for ECLI:NL:HR:2015:3091, a decision issued by the Supreme Court, and posted the result in a word cloud:

CLOUD

The main drawback of the BOW approach is that it assumes a positive correlation between word frequency and relevance. In other words: the more the decision mentions a particular word, the greater the likelihood that word is indicative of the decision’s content. It also performs better if each word refers to an independent concept (i.e. “claimant” does not equal victim), as it disregards that different words may refer to the same legal concept and that different legal concepts may fall in the same general category. For this reason, building and using ontologies (knowledge systems) are crucial in legal text mining.

Next upcoming weeks, I will actually use text classification algorithms with a richer set of features to analyze the decision. Let me know what you think!

 

Posted in Data Science, Dutch Law, Geen categorie, Judicial Behavior, Legal AnalyticsTagged economic analysis of law, judicial behavior, law enforcement, python, rechtspraak.nl, text analytics
Proudly powered by WordPress | Theme: micro, developed by DevriX.