Legal Technology Hub Article: Sources Matter, and Other Research Conundrums in the GenAI Era

Published on 2024-04-23 bySarah Glassmeyer

Now that the initial rush of excitement over Generative AI and LLMs has passed, those digging into the efficacy and sustainability of these tools are looking at topics such as the provenance of the training data.  The copyright status of the material seems to be the primary issue of concern, with lawsuits pending against Open AI[1] and Nvidia[2] among others. While the lawsuits on LLM training are working their way through the courts, we wondered about something a little more clearcut and perhaps even more basic.

External data has been used in legal technology products for years, long before the rise of LLMs. The legal research and docket analysis verticals are populated with repackaged and repurposed content that comes from a host of government, IGO, and NGO sources.  Depending on the type of content and the creator/publisher, the acessiblity and usability of the content varies wildly.

You may be surprised to learn that accessibilty and usability concerns even include primary law. In the United States, at least, all law is assumed to be in the public domain. However, impediments to access include copyright, both incorrect assertions by governments or embedding public domain material in copyrighted content; the format of publication, including if it’s born digital or a print product; and potential corporate control such as via exclusive publishing licenses.

Some of these concerns stem from subtle challenges arising from the way laws have traditionally been reported. For example, while the text of a court decision is public, the synopsis and headnote is likely copyrighted, making it difficult for a secondary provider to safely use technology to scrape the law from a court reporter (see graphic below, where this is explored further).

These impediments to access have stopped some legal tech creators from obtaining copies of a corpus to use to build their products. However, not all of them.  Primary law content has been obtained by various means with varying levels of compliance with copyright law and Terms of Services. Some vendors have been willing to sell their data to potential competiors. Some companies paid to have books rekeyed by overseas contractors. Some have used OCR on digitized print sources. Some used data scraping techniques on both locked and open content.

Read the full article