Anybody re-reading this article will see that we initially said in the headline that Dahn implied in his statement below that Stanford report was wrong.
I’ve subsequently been contacted by TR’s communications department asking me to re-look at his statement as they believe the use of the word “wrong” was, well wrong.
It’s all a bit of a quibble but i’m happy to use his words instead which say they understand that there are issues with the product in the development stage.
For me though this does highlight another issue made clear in the statement and that is, clients are using a product that has a long way to go before it is error free. If ever, this AI and essentially we are training tech to think for itself so that by its very nature implies an element of error for perpetuity.
If we were talking about buying a car, getting on a plane and so on and so forth the concept of having products placed in the market that don’t work would be unthinkable.
Why should the law be different?
Obviously I have no idea what the deal between Westlaw and their clients are, underfunded rags like this aren’t really top of their agenda and I’m only contacted when the algorithim might lead possible future clients to the word, “wrong”.
Clients may be testing the products for free , they may not . I’m just not in that loop
There’s elements of truth and reality in both what Stanford say and what Dahn says as well as a touch of obsfucation from both parties which in turn is a nice little parable about AI development.
This, as they say is just the beginning.
Dahn Says
Our Commitment to Our Customers
Let me start by thanking the Stanford research team for their efforts in aligning the AI community around a common set of standards and benchmarks that will further the development of trusted and safe AI. We reviewed the updated paper Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools with interest here at Thomson Reuters and wholeheartedly agree with the spirit of the research and the ambition of the team running the study.
Last year, we released AI-Assisted Research in Westlaw Precision to help our customers do their legal research much faster and better. I’d like to share with you a little more on our approach. Prior to releasing it, and since its release, we test it rigorously with hundreds of real-world legal research questions, where two lawyers graded each result, and a third, more senior lawyer, resolved any disagreements in grading.
We also tested with customers prior to release – first explaining that AI solutions should be used as an accelerant, rather than a replacement for their own research. In our testing with customers their feedback was extraordinarily positive even when they found inaccuracies themselves. They would tell us things like, “this saves hours of time” or “it’s a game-changer.” We’ve never had a more exciting update to Westlaw.
We are very supportive of efforts to test and benchmark solutions like this, and we’re supportive of the intent of the Stanford research team in conducting its recent study of RAG-based solutions for legal research, but we were quite surprised when we saw the claims of significant issues with hallucinations with AI-Assisted Research. In fact, the results from this paper differ dramatically from our own testing and the feedback of our customers.
We are committed to working with the researchers to find out more, but in my experience, one reason the study may have found higher rates of inaccuracy than we have in our internal testing is because the research included question types we very rarely or never see in AI-Assisted Research. A key lesson learned here is that user experiences in these products could be more explicit about specific limitations of the system.
We focused first on testing AI-Assisted Research on the types of real-world legal research questions our customers are bringing to Westlaw every day. Our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it, and we’ve been very clear with customers that the product can produce inaccuracies. As well as speaking with our customers each day, this notice appears on the homepage of AI-Assisted Research and has since its initial launch:
AI-Assisted Research uses large language models and can occasionally produce inaccuracies, so it should always be used as part of a research process in connection with additional research to fully understand the nuance of the issues and further improve accuracy.
We also advise our customers, both in the product and in training, to use AI-Assisted Research to accelerate thorough research, but not to use it as a replacement for thorough research.
In discussing inaccuracies with customers prior to launch, they saw enormous value in the feature even when it produced occasional inaccuracies, as long as it could be used in combination with other tools in Westlaw to get to an accurate result quickly. Even when inaccuracies appear, using the references listed in the answer and tools like KeyCite or statutes annotations, inaccurate answers can typically be quickly identified.
While our internal testing has been and continues to be rigorous, we see a clear need for third-party evaluation of real-world use of these systems. The development of reliable, robust benchmarks is critical for the responsible adoption of AI. Benchmarking is an increasingly challenging (and resource-intensive) area of research, especially in expert domains like law. To that end, Thomson Reuters would very much like to partner with the Stanford research team to explore the creation of a consortium of stakeholders to work together to develop and maintain state-of-the-art benchmarks across a range of legal use cases. Talks are early but we are hopeful we can find a way to work together on this important opportunity.
Our focus has been, and continues to be, on our customers. We encourage our customers to compare the accuracy of our products versus others on the market. We are confident that Westlaw AI-Assisted Research provides the most complete, well-grounded, and accurate responses. We’ve had so many customers run side-by-side tests and choose Westlaw. We continue to work closely with our customers to improve and evolve our solutions.
Thomson Reuters stands behind the accuracy and efficacy of its products and we look forward to working with academia and other industry partners to help customers understand both the benefits and limitations of these emerging technologies.
This is a guest post from Mike Dahn, head of Westlaw Product Management, Thomson Reuters.
Artificial Lawyer write
Thomson Reuters has contradicted the findings of a recent Stanford HAI study into its genAI research capability within Westlaw Precision, and stated ‘our thorough internal testing of AI-Assisted Research shows an accuracy rate of approximately 90% based on how our customers use it’.
The claim, which is part of a piece written by Mike Dahn, head of Westlaw Product Management, was made in response to the Stanford University HAI study in May that found the results from its Westlaw genAI product were highly problematic, (see below).
The results of the Stanford study show an accuracy rate for the Westlaw genAI product of only 42% and an overall hallucination rate of 33% – see here.

Meanwhile, as noted, TR says they are at ‘90%’ accuracy, and they also avoid discussing in detail the appearance of hallucinations in research results. But, they also then state in their response that they are open about errors taking place:
‘AI-Assisted Research uses large language models and can occasionally produce inaccuracies, so it should always be used as part of a research process in connection with additional research to fully understand the nuance of the issues and further improve accuracy.
‘We also advise our customers, both in the product and in training, to use AI-Assisted Research to accelerate thorough research, but not to use it as a replacement for thorough research.’
The company then goes on to thank the Stanford team and also the idea of benchmarking genAI legal solutions.
However, they do not really back down at all on the key message: they are accurate, no matter what Stanford has said. Their statement then underlines how they have been very thorough in how they test:
‘Last year, we released AI-Assisted Research in Westlaw Precision to help our customers do their legal research much faster and better…Prior to releasing it, and since its release, we test it rigorously with hundreds of real-world legal research questions, where two lawyers graded each result, and a third, more senior lawyer, resolved any disagreements in grading.
‘We also tested with customers prior to release – first explaining that AI solutions should be used as an accelerant, rather than a replacement for their own research. In our testing with customers their feedback was extraordinarily positive even when they found inaccuracies themselves. They would tell us things like, “this saves hours of time” or “it’s a game-changer.” We’ve never had a more exciting update to Westlaw.’
Overall, the TR statement is something of a defiant response showing confidence in the product, while also balancing this with the additional point that they know errors appear and that AI is ‘an accelerant, rather than a replacement’. However, they are bullish on the accuracy level, i.e. 90% for TR’s tests vs 42% for the Stanford test, plus plenty of hallucinations.
What Does It All Mean?
Primarily TR’s comments send the message to the market that whatever Stanford is saying, it’s wrong to make the conclusions it did.
Not that TR had much choice here. How can a company that serves lawyers sell a product that is only ‘42% accurate’ and hallucinates a lot as well? But, TR says it tests a lot with lawyers and the accuracy is actually very high. So, it IS good to use.
Go figure. And, perhaps only you – the user of the product – can ever really get to the answer as you are the ones using it for real work needs. Although, objective benchmarks for genAI that the legal community agreed to would go a long way to helping, especially before firms bought this software.
This site has plenty of additional questions for TR and will be sending them today and hopefully getting a fuller response. And it’s worth mentioning that when AL asked for a statement in May about the Westlaw test by Stanford this site was sent this general statement, plus a smaller comment that stemmed from that same statement.
P.S. one point many people have wondered about: ‘How much of this is basically Casetext?’ has still not been answered directly. However, it seems fairly clear than before the deal with them that TR didn’t have extensive genAI capabilities for case law research and after the deal it did. But, AL will keep asking TR for clarity on this.
Conclusion
It’s understandable that TR wants to draw a line under this and move on. The problem is that the Stanford study is just so radically far apart from TR’s own findings. There are still several questions to answer. AL will keep asking them.