This is a preview of a project that I’ve been working on in preparation for a couple presentations. This is a preview of the analysis that I’ll be running through any time a new model of interest or an update to a frontier model is made. This is part of a larger benchmark which will help grade the effectiveness of an individual language model for a particular task.

Lab Notes

GPT-4o was released yesterday and I’ve had a chance to compare it to the output of the tests run earlier in the week. GPT4-o jumped immediately to the top of the list. The full synthetic discussion of the full board can be found below.

A brief note on the methodology I’m following here, each model is provided with an identical system prompt that forces chain-of-thought reasoning and given a role as a helpful legal assistant. Then a prompt is delivered to them that simulates a person verbally dictating known facts and issues to discuss, followed by a request for the generation of a discussion of the issues. This example memo, which will be released along with all testing data, protocols, and logs, was then evaluated against a common set of criteria and with the aid of a sample answer.

One limitation to this and something to remember is that this benchmark assumes NO iteration or back and forth so it’s very likely that any of the top five could be nearly indistinguishable after a few minutes of conversation.

I read through each result as they were read in and as I was having the analysis done and recorded, and generally would strongly agree with the tiers. Unsurprisingly these tiers mesh with what we’re seeing in the market in terms of pricing between families. Llama 70B performing as well as it does is extremely interesting given how easily and effectively it can run locally on hardware that doesn’t involve fans you have to listen to.

Tomorrow is Google IO and I’ll be adding what’s unveiled there into the rankings soon.

Below is a synthetic (Claude did most of the work coming from the same thread that did the analysis) discussion of the rankings, and how things shake out.

-jb

Rank Model Tier Highlights Lowlights
1 GPT-4o S-tier Comprehensive analysis, strong legal reasoning, practical insights, clear writing style Could explore more nuanced legal standing and unconventional strategies
2 GPT-4 S-tier Well-structured analysis, attention to detail, strategic insights, persuasive writing May benefit from more emphasis on practical considerations
3 GPT-4 Turbo S-tier Thorough and nuanced analysis, creative solutions, practical considerations, engaging writing Could further explore broader policy implications
4 Claude 3 Opus S-tier Well-organized analysis, solid understanding of negligence principles, clear writing May benefit from more in-depth exploration of strategic options
5 CodeLlama 70B (Groq) A-tier Competent analysis, innovative legal theories, consideration of policy implications Could provide more detailed analysis of damages and liability
6 Claude 3 Haiku A-tier Concise and effective summary, good grasp of legal principles, practical considerations Limited depth due to concise format
7 Claude 3 Sonnet A-tier Well-structured analysis, balanced assessment of case strengths and weaknesses May benefit from more detailed exploration of damages and strategic options
8 Perplexity Llama 3 Sonar Large B-tier Solid analysis of negligence and liability, practical advice on damages Could provide more in-depth analysis and strategic insights
9 Mixtral 8x7B C-tier Competent analysis covering essential aspects of the case Lacks depth and strategic thinking compared to higher-ranked models

Tier Breakdown:

  • S-tier (Highly recommended): GPT-4o, GPT-4, GPT-4 Turbo, Claude 3 Opus
  • A-tier (Recommended with some limitations): CodeLlama 70B (Groq), Claude 3 Haiku, Claude 3 Sonnet
  • B-tier (Recommended with caution): Perplexity Llama 3 Sonar Large
  • C-tier (Not recommended for this use case): Mixtral 8x7B

Discussion


Based on the evaluation of the nine AI models in this 0-shot test, it’s clear that the GPT options (GPT-4o, GPT-4, and GPT-4 Turbo) along with Claude 3 Opus stand out as the top performers, constituting the S-tier. These models demonstrate strong legal reasoning capabilities, provide comprehensive and nuanced analyses, and offer practical insights and strategic considerations. Their outputs closely resemble the work of experienced attorneys, making them highly appealing as potential copilots for legal professionals. However, it’s important to note that this represents a single test case, and models may behave differently as users engage with them over extended interactions.

The A-tier models, including CodeLlama 70B (Groq) and the Claude 3 variations (Haiku and Sonnet), also perform well, delivering competent analyses and demonstrating a good grasp of the relevant legal principles. However, they may have some limitations in terms of depth, nuance, or strategic thinking compared to the S-tier models. If I were considering an AI assistant for legal work, I might explore these models further, but with the understanding that they may require more oversight and guidance compared to the S-tier options.

Perplexity Llama 3 Sonar Large, falling into the B-tier, provides a solid analysis of the key legal issues but lacks the depth and strategic insights of the higher-ranked models. While it shows promise, I would approach this model with caution and be prepared to provide more hands-on direction and review.

Finally, Mixtral 8x7B, in the C-tier, delivers a competent analysis that covers the essential aspects of the case but lacks the depth and strategic thinking demonstrated by the other models. For this particular use case, I would probably pass on Mixtral 8x7B for now, as it may not provide the level of support and insights I would be looking for in a complex legal matter.

When comparing the top-performing models, GPT-4o and Claude 3 Opus, some subtle differences emerge in their approach and output. GPT-4o demonstrates a slightly more comprehensive and nuanced understanding of the legal principles, breaking down each element of negligence and applying them to the facts of the case in a more granular manner. Its language and reasoning also come across as more sophisticated and legally fluent, closely mimicking the writing style of an experienced attorney.

Claude 3 Opus, while still providing a high-quality analysis, may not delve quite as deeply into the nuances of the legal issues. Its output, while well-organized and professional, may not match the level of legal fluency demonstrated by GPT-4o.

However, it’s important to recognize that both GPT-4o and Claude 3 Opus demonstrate impressive capabilities in legal analysis and communication, and the differences between them are relatively subtle. In a real-world scenario, I would feel confident working with either model as an assistant, with GPT-4o having a slight edge in terms of depth, nuance, and legal fluency.

As AI continues to advance and new models emerge, it will be crucial for legal professionals to stay informed about the capabilities and limitations of these tools. Regular benchmarking exercises, like the one conducted in this project, can help attorneys make informed decisions about which AI models to potentially incorporate into their practice and how to effectively leverage them to enhance their work and better serve their clients. However, it’s essential to remember that AI is a tool to augment, not replace, human legal expertise, and any insights provided by these models should be carefully reviewed and validated by experienced legal professionals.