Does Legal Analytics Really Need “Big Data” to Make Predictions?

If you’re at all interested in legal technology, you’ve probably grown tired of the recent influx of fear-mongering articles about “robot lawyers” that are going to put legal professionals out of a job. This sub-genre of legal tech reporting features a lot of buzzwords. There’s “machine learning”, “NLP (natural language processing)”, “AI (artificial intelligence)”, and “predictive analytics”, to name just a few. Regrettably, a lot of these articles discuss legal technologies only in very vague terms, or sometimes don’t bother with definitions at all. And it’s very difficult to have a nuanced conversation about legal tech when it seems like everyone is talking past each other and no one can agree on the basics. So I’d like to go back to the beginning and start with something foundational: what actually is legal analytics?

In an InfoWorld article from 10 years ago, Jeremy Kirk bemoaned the fact that the word “analytics” was used so broadly that nearly everyone seemed to have a different idea of what it meant. Kirk describes how Gartner, an information technology research company, had put together an informal survey asking its users to define the term. The responses “were completely all over the place” and left Gartner with “more questions than answers”. In the decade since that article was published, the situation has improved — but not by much.

After seeing the results of that survey, Gartner put together a working definition of analytics that I think is still very useful today:

Analytics leverage data in a particular functional process (or application) to enable context-specific insight that is actionable.”

It’s a little bit dense, but what I find most helpful about that definition is that it highlights two of the most important components of analytics: that it gives you information about data that exists in a particular context (like the law) and that it’s information you can use for something.

Here’s a very basic example of analytics in action. If you’re reading this article on Slaw, you’ll see a section called “The Count” at the top of the right-hand menu. As I’m writing this, it’s telling me that there are 14,156 posts on Slaw and 18,449 comments. This is analytics in its simplest form. It takes a body of data (Slaw posts) and uses metrics to illuminate a context-specific aspect of that data that wouldn’t otherwise be obvious. These numbers are also useful. As a reader, they give me a rough sense of Slaw’s size: it’s not The Globe and Mail, but it’s definitely not a small personal blog. Or, if I’m on Slaw’s staff and I think it might be a good idea to try to improve the average reader engagement each post gets, I can perform a calculation on those two metrics to find out how many comments a post gets on average (about 1.3 comments per post) and use that as a baseline measurement.

In the Canadian legal tech space, analytics companies tend to fall into one of two categories: descriptive analytics and predictive analytics. This distinction is important. Generally speaking, descriptive analytics tell you what has happened and predictive analytics tell you what is likely to happen in the future. (Though if a predictive analytics tool claims to tell you what definitely will happen in the future, watch out.) Without getting into too much technical detail, predictive analytics uses algorithms, modeling, and machine learning to arrive at an answer. Descriptive analytics may make use of these tools and often does, but it can also arrive at answers by taking advantage of very simple mathematical methods, like counting or averaging.

Omar Ha-Redeye recently published an article on Slaw arguing that big data is needed for accurate legal AI and predictive analytics, and that Canada just doesn’t have enough legal data for good predictions because we simply don’t have enough published decisions. He’s not wrong that when it comes to prediction, you certainly want to have access to the largest data set possible, and that generally speaking, predictive accuracy tends to increase as the size of the data set increases. However, I strongly disagree that “big data” is necessary in order to tackle any and all predictive analytics problems in the legal sphere.

While there isn’t a strict, universally agreed upon definition of “big data”, I think most data professionals would agree that the 1.5 million published decisions that are currently hosted on CanLII don’t qualify. (As a point of reference, there are about 5.5 billion Google searches performed every day. Tracking and analyzing that dataset is definitely a problem that falls squarely in the realm of big data.) In fact, even the current body of American case law, which is far more extensive, probably wouldn’t qualify as big data.

But predictive analytics and machine learning projects vary wildly in scale and scope. It goes without saying that there are many predictive analytics problems in law that would require far more than 1.5 million decisions in order to make strong predictions. But there are also situations where we know in advance that certain elements are highly predictive of certain outcomes, and in a case like that, even a small data set can provide good predictive guidance. There’s a huge difference between a hypothetical product that could accurately predict the outcome of any case just by analysing the text of past written decisions (this is likely impossible at the current moment) and a predictive analytics product that provides an answer to a well-defined question with only two possible outcomes. For instance, Blue J Legal is a legal tech tool that uses machine learning, but it doesn’t claim to answer any possible question anyone might ever have about tax law. Instead, it’s currently tackling specific classification problems, like whether someone is a contractor or an employee.

All this being said, most analytics, both in general and specifically in the legal space, are actually descriptive and not predictive. Descriptive analytics can usefully describe and categorize a dataset without making a probabilistic claim. Though not “predictive” in the technical sense, this method can still be used to see patterns in historical data in order to make smart, educated inferences about outcomes.

Here’s a basic example from real estate. When looking for home prices, many people use reporting systems like Realtor.ca to do a quick search of sorted and structured real estate data. Realtor.ca cannot predict the price of a specific given house; however, this does not mean that the information is not useful. In fact, it’s a widely used tool, and very rarely does someone sell or purchase a home without checking databases like this (or getting their realtor to do it). Although a real estate database can’t opine on the selling price of your home, it can provide you with data that allows you to price a home appropriately given the market conditions and the attributes of the home in question.

A lot of legal tech companies are trying to create systems similar to this for the law. Legal research traditionally involves manually reviewing case law to find relevant precedents. But rather than each individual lawyer performing the same research over and over in a time-limited, necessarily incomplete manner on a huge body of case law, several companies in the legal tech space have decided that there has to be a better, more complete way of performing this analysis.

Some Canadian legal tech companies like Rangefindr, and my company, Loom Analytics, provide purely historical case law metrics using descriptive analytics, while others such as Blue J Legal have chosen to build predictive machine learning models. In all cases it’s important to note that these products are built using the exact same body of data that has been used in legal research for decades, as lawyers are highly unlikely to be able to quote a decision in court that hasn’t been published. One thing that case law analytics brings to the table is the ability to see metrics based on the full available data set instead of what can be found and reviewed during the limited amount of research time available for most litigation.

Lawyers know that some cases carry more weight than others, and legal tech companies know this too. In a precedent based legal system, a single data point may be all that you need, and in general, the goal of legal analytics tools is to help lawyers find the data points that are most relevant and meaningful.Though sensationalistic articles about robot lawyers might try to convince you otherwise, legal analytics tools don’t eliminate the need for case law research. They exist to make it easier for living, breathing legal professionals to do their job, in the highly specific contexts in which they operate, in as informed a way as possible. And for that, you don’t need big data. You just need good data

Does Legal Analytics Really Need “Big Data” to Make Predictions?