Vals Publishes Results of Legal AI Benchmark Study

Artificial Lawyer Reports

Vals AI, the US-based company providing genAI performance testing, has published its first study of how several legal tech companies responded to a series of tests set for them by major law firms, including Reed Smith and Fisher Phillips.

The companies to have their results shared include: Harvey, Thomson Reuters’ CoCounsel, Vecflow, and vLex. The human lawyers who acted as a fleshy comparator were provided by ALSP Cognia. (Note: there are AL TV Product Walk Throughs of Vecflow and vLex to get some more context – here.)

Vals’ proprietary ‘auto-evaluation framework platform’ was then used ‘to produce a blind assessment of the submitted responses against the model answers’.

Results Overview

And here is an overview of the Vals results, which AL is sharing here verbatim:

‘The seven tasks evaluated in this study were Data Extraction, Document Q&A, Document Summarization, Redlining, Transcript Analysis, Chronology Generation, and EDGAR Research, representing a range of functions commonly performed by legal professionals. The evaluated tools were CoCounsel (from Thomson Reuters), Vincent AI (from vLex), Harvey Assistant (from Harvey), and Oliver (from Vecflow). Lexis+AI (from LexisNexis) was initially evaluated but withdrew from the sections studied in this report.

The percentages below represent each tool’s accuracy or performance scores based on predefined evaluation criteria for each legal task. Higher percentages indicate stronger performance relative to other AI tools and the Lawyer Baseline.

Some key takeaways include:

  • Harvey opted into six out of seven tasks. They received the top scores of the participating AI tools on five tasks and placed second on one task. In four tasks, they outperformed the Lawyer Baseline.
  • CoCounsel is the only other vendor whose AI tool received a top score. It consistently ranked among the top-performing tools for the four evaluated tasks, with scores ranging from 73.2% to 89.6%.
  • The Lawyer Baseline outperformed the AI tools on two tasks and matched the best-performing tool on one task. In the four remaining tasks, at least one AI tool surpassed the Lawyer Baseline.

Beyond these headline findings, a more detailed analysis of each tool’s performance reveals additional insights into their relative strengths, limitations, and areas for improvement.

Harvey Assistant either matched or outperformed the Lawyer Baseline in five tasks and it outperformed the other AI tools in four tasks evaluated. Harvey Assistant also received two of the three highest scores across all tasks evaluated in the study, for Document Q&A (94.8%) and Chronology Generation (80.2%—matching the Lawyer Baseline).

Thomson Reuters submitted its CoCounsel 2.0 product in four task areas of the study. CoCounsel received high scores on all four tasks evaluated, particularly for Document Q&A (89.6%—the third highest score overall in the study), and received the top score for Document Summarization (77.2%). For the four tasks evaluated, it achieved an average score of 79.5%. CoCounsel surpassed the Lawyer Baseline in those four tasks alone by more than 10 points.

The AI tools collectively surpassed the Lawyer Baseline on four tasks related to document analysis, information retrieval, and data extraction. The AI tools matched the Lawyer Baseline on one (Chronology Generation). Interestingly, none of the AI tools beat the Lawyer Baseline on EDGAR research, potentially signaling that these challenging research tasks remain an area in which legal AI tools still fall short on meeting law firm expectations.

Redlining (79.7%) was the only other skill in which the Lawyer Baseline outperformed the AI tools. Its single highest score was for Chronology Generation (80.2%). Given the current capabilities of AI, lawyers may still be the best at handling these tasks.

Scores for the Lawyer Baseline were set reasonably high for Document Extraction (71.1%) and Document Q&A (70.1%), but some AI tools still managed to surpass them. All of the AI tools surpassed the Lawyer Baseline for Document Summarization and Transcript Analysis. Document Q&A was the highest-scoring task overall, with an average score of 80.2%. These are the tasks where legal generative AI tools show the most potential.

EDGAR Research was one of the most challenging tasks and had a Lawyer Baseline of 70.1%. In this category, Oliver was the only contender at 55.2%. Increased performance on EDGAR Research—a task that involves multiple research steps and iterative decision-making—may notably require further accuracy and reliability improvements in the nascent field of “AI agents” and “agentic workflows.” For details on AI challenges, see the EDGAR Research section.

Overall, this study’s results support the conclusion that these legal AI tools have value for lawyers and law firms, although there remains room for improvement in both how we evaluate these tools and their performance.’

Note: LexisNexis, which had engaged with the project team ‘withdrew’. In a statement to Artificial Lawyer, the company said: ‘The timing didn’t work for LexisNexis to participate in the Vals AI legal report. We deployed a significant product upgrade to Lexis+ AI, with our Protégé personalized AI assistant, which rendered the benchmarking analysis out-of-date and inaccurate.’

Read full report at

Vals Publishes Results of First Legal AI Benchmark Study