Vals Legal AI Research Report Published.. ‘The Big Players Didn’t Play’

I’m unable to  give you a link to access this as they appear to have put it behind a security wall – so here’s the intro to an analysis by Artificial Lawyer. One of the key findings it suggest is that Chat GPT if used properly is no better or worse than the legal publishers versions.

The latest Vals Legal AI Report (VLAIR) has been published and focuses on legal research. It’s certainly got attention because key players didn’t take part and AI clearly ‘beat’ human lawyers. AL takes a look at what it all means.

First, what is Vals? It’s a US business working on the evaluation of AI systems. The focus here on legal research is just one of many projects they run. They brought in a group of legal AI companies, some human lawyers, and also tested against a general model. In this case the evaluated products were Alexi, Counsel Stack, Midpage, plus ChatGPT. At least one major vendor did take part, but then didn’t want its results made public (more on that below).

How does Vals compare human lawyer work with that of AI? They explained that they ‘establish a baseline measure of the quality of work produced by the average lawyer, unaided by generative AI. To achieve this, we partnered with a US law firm who supplied lawyers experienced in conducting legal research for client matters. The lawyers were asked to answer the Dataset Questions based on the exact same instructions and context provided to the AI products.’

And in terms of how it was all carried out, the ‘response collection took place over the first three weeks of July 2025. All questions were submitted to each product via API as a zero-shot prompt.’ So, companies – and general models – may have improved since then, as three months is a long time in genAI. Also, the test was not about how people used AI tools and the application layers that come with them, it was really laser-focused on a straight question/answer approach, without follow-ups.

The overall finding is shown below, i.e. all the AI systems, including the general model, beat the human lawyers on 200 legal research question ‘distributed across a range of question types typically encountered in private practice’. They used three main criteria:

‘- Accuracy: whether the response is substantively correct with no incorrect elements (50% of average weighted score)

– Authoritativeness: whether the response is supported by citations to relevant and valid primary and/or secondary law sources (40% of average weighted score)

– Appropriateness: whether the response is easy to understand and could be immediately shareable with colleagues or clients (10% of average weighted score)’.

The AIs won on all three.

AL View

‘AI Is Better Than Lawyers At Legal Research’

OK, this is the big one that has gained the most attention. But, what does it mean? The ‘[test group of lawyers] were allowed to use all research resources and tools at their disposal provided they were not generative AI-based’. I.e. the human lawyers were also using digital databases packed full of legal information.

So, is this a surprise? On one level, no. Legal data stores are vast, complex, and can absorb huge amounts of time if lawyers are able to keep digging and digging. Given genAI’s strong language understanding capabilities, and the ability to tap this vast sea of data very quickly, it isn’t surprising that human lawyers were beaten here.

Also, when it comes to things like summarization, it would again – because of genAI’s language understanding skills – have been surprising if a test group of human lawyers had clearly beaten the AI.

If the lawyers had been strongly incentivised to spend considerable time on each and every question, and perhaps had a supervising partner there to keep saying ‘not good enough, go back and keep looking’, or ‘no, rephrase that, add more context, do it again’, and provide other high-quality feedback, then one would have expected the test group of lawyers to do a lot better. But, they were not in that situation, it appears.

OK, so, genAI tools are better at legal research tasks. What does this tell us? Well, it tells us that AI tools are good at handling tons of language. The questions, see below, didn’t ask the AI tools to then take that information and go to court with it, stand in front of a judge, or build a litigation strategy with the client. So, we are still far from ‘the end of lawyers’. Although, it may make some law librarians a bit nervous.

Moreover, is anyone really that sad that AI tools can provide very good legal research results? If AI tools can’t do well on this type of database-focused task, then they wouldn’t be much use for anything else.

Overall, A) this is not a big surprise, and B) this is welcome news. And C) we can add the caveats that the lawyers were not heavily incentivised to keep working on these questions, nor does a distinct research output equate to sophisticated legal skills in court.

That said, if you are a client reading this, and your law firm is charging you a lot for legal research, then you may want to have a conversation with them.

‘The Big Players Didn’t Play’

As noted, AL knows that at least one ‘big player’ took part and then decided it didn’t want its results shown – which is their prerogative. However, this opens up a range of issues.

  • Are studies valid if only a small number of vendors take part, especially if the big brands are not there? The short answer is that having the big legal data players in the study would have made it much more valuable to everyone, because those are the ones that most lawyers use. Simple as that.
  • Why didn’t the big players play ball? As with previous studies (by a range of groups), some companies have grumbled that the gap between the test and the publication is so great that the scores are not meaningful any longer, as three months is a long time in legal AI. That is a fair argument – although, the other vendors are all in the same boat. They could also argue that the tasks were better fitted to some companies and not their one. But, if you are in ‘legal research’ and the tasks are all about ‘legal research’, then surely this is a test you can do well at? One final reason is simply that the bigger brands have more to lose by comparing against other companies, even if they only fall short by 1% on a metric; and their top teams do not want that kind of message to go public….even if the end scores are only part of the value the big platforms provide. I.e. the application layer and all that goes with a sophisticated platform counts for a lot, and one metric in one study should not be seen as a final proof.
  • Should the big players play ball? Well….yes. At the end of the day evals are only really useful if they test the tools that most lawyers mostly use – as well as the new kids on the block. Much as the ones tested are doing great work, we all know which tools law firms tend to use for legal research needs.

Do Tests Like This Give Us A Full Picture?

The reality is that a lawyer’s work is not just ‘find an answer to this specific question’. The work of a lawyer is within an ever-complexifying world, with multiple layers, client interactions, battling the other side’s lawyers, and much more. The ability to use AI to answer a legal research query is not the end of lawyers, not even close – not unless a client’s main reason for hiring a law firm is to do legal research for them.

Read the full analysis

Vals Legal AI Research Eval – The Aftermath

 

https://www.vals.ai/industry-reports/vlair-10-14-25