Let me start by thanking the Stanford research team for their efforts in aligning the AI community around a common set of standards and benchmarks that will further the development of trusted and safe AI. We reviewed the updated paper Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools with interest here at Thomson Reuters and wholeheartedly agree with the spirit of the research and the ambition of the team running the study.

Last year, we released AI-Assisted Research in Westlaw Precision to help our customers do their legal research much faster and better. I’d like to share with you a little more on our approach. Prior to releasing it, and since its release, we test it rigorously with hundreds of real-world legal research questions, where two lawyers graded each result, and a third, more senior lawyer, resolved any disagreements in grading.

We also tested with customers prior to release – first explaining that AI solutions should be used as an accelerant, rather than a replacement for their own research. In our testing with customers their feedback was extraordinarily positive even when they found inaccuracies themselves. They would tell us things like, “this saves hours of time” or “it’s a game-changer.” We’ve never had a more exciting update to Westlaw.

We are very supportive of efforts to test and benchmark solutions like this, and we’re supportive of the intent of the Stanford research team in conducting its recent study of RAG-based solutions for legal research, but we were quite surprised when we saw the claims of significant issues with hallucinations with AI-Assisted Research. In fact, the results from this paper differ dramatically from our own testing and the feedback of our customers.

We are committed to working with the researchers to find out more, but in my experience, one reason the study may have found higher rates of inaccuracy than we have in our internal testing is because the research included question types we very rarely or never see in AI-Assisted Research. A key lesson learned here is that user experiences in these products could be more explicit about specific limitations of the system.

We focused first on testing AI-Assisted Research on the types of real-world legal research questions our customers are bringing to Westlaw every day. Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it, and we’ve been very clear with customers that the product can produce inaccuracies. As well as speaking with our customers each day, this notice appears on the homepage of AI-Assisted Research and has since its initial launch:

AI-Assisted Research uses large language models and can occasionally produce inaccuracies, so it should always be used as part of a research process in connection with additional research to fully understand the nuance of the issues and further improve accuracy.

We also advise our customers, both in the product and in training, to use AI-Assisted Research to accelerate thorough research, but not to use it as a replacement for thorough research.

In discussing inaccuracies with customers prior to launch, they saw enormous value in the feature even when it produced occasional inaccuracies, as long as it could be used in combination with other tools in Westlaw to get to an accurate result quickly. Even when inaccuracies appear, using the references listed in the answer and tools like KeyCite or statutes annotations, inaccurate answers can typically be quickly identified.

While our internal testing has been and continues to be rigorous, we see a clear need for third-party evaluation of real-world use of these systems. The development of reliable, robust benchmarks is critical for the responsible adoption of AI. Benchmarking is an increasingly challenging (and resource-intensive) area of research, especially in expert domains like law. To that end, Thomson Reuters would very much like to partner with the Stanford research team to explore the creation of a consortium of stakeholders to work together to develop and maintain state-of-the-art benchmarks across a range of legal use cases. Talks are early but we are hopeful we can find a way to work together on this important opportunity.

Our focus has been, and continues to be, on our customers. We encourage our customers to compare the accuracy of our products versus others on the market. We are confident that Westlaw AI-Assisted Research provides the most complete, well-grounded, and accurate responses. We’ve had so many customers run side-by-side tests and choose Westlaw. We continue to work closely with our customers to improve and evolve our solutions.

Thomson Reuters stands behind the accuracy and efficacy of its products and we look forward to working with academia and other industry partners to help customers understand both the benefits and limitations of these emerging technologies.

This is a guest post from Mike Dahn, head of Westlaw Product Management, Thomson Reuters.