From GPT-4 to Llama 3 LMSYS Chatbot Arena Ranks Top LLMs

Himanshi Singh 01 May, 2024 • 5 min read

Introduction

Every week, new and more advanced Large Language Models (LLMs) are released, each claiming to be better than the last. But how can we keep up with all these new developments? The answer is the LMSYS Chatbot Arena.

The LMSYS Chatbot Arena is an innovative platform created by the Large Model Systems Organization, a group made up of students and teachers from UC Berkeley, UCSD, and CMU. This platform makes it easy to compare and evaluate different LLMs by allowing users to test and rate them. It’s a place where anyone interested in these models can come to find out about the latest releases and see how they stack up against each other.

LMSYS Leaderboard

This leaderboard ranks various LLMs using a Bradley-Terry model, with the rankings displayed on an Elo scale. The LMSYS leaderboard collects human pairwise comparisons to determine the ranking. As of April 26, 2024, the leaderboard includes 91 different models and has collected more than 800,000 human pairwise comparisons. The models are ranked based on their performance in different categories, such as coding and long user queries. The rankings are displayed in Elo-scale, and the leaderboard is continuously updated.

Click here to start the live testing of LLMs.

Top 10 LLMs

The top and trending models based on Arena Elo Ratings are:

GPT-4-Turbo by Open AI
GPT-4-1106-preview by Open AI
Claude 3 Opus by Anthropic
Gemini 1.5 Pro API-0409-Preview by Google
GPT-4-0125-preview by Open AI
Bard (Gemini Pro) by Google
Llama 3 70b Instruct by Meta
Claude 3 Sonnet by Anthropic
Command R+ by Cohere
GPT-4-0314 by Open AI

Open AI is clearly winning the race of best LLMs so far.

Now if you’re like me and wondering why there is a term preview in front of some models then here is the answer – The term “preview” typically refers to a version of a large language model (LLM) that is made available for testing, feedback, or experimental use before its official release. This “preview” stage allows developers and users to explore the model’s capabilities, identify any issues, and provide feedback, which can be incorporated into further improvements or refinements of the model. Essentially, it’s like a beta version of the software, where it’s mostly functional and showcases new features or improvements, but might still have some bugs or limitations that need addressing before a full, stable release.

The rankings take into account the 95% confidence interval when determining a model’s ranking, and models with fewer than 500 votes are removed from the rankings.

Difference between Open Source vs Closed Source LLMs

You might have heard that Llama 3 is the best open source Large Language Model (LLM) so far. However, if you check the overall rankings, GPT-4 Turbo is at the top. Why is that? It’s because the rankings include both open source and closed source LLMs.

Look at the last column of the leaderboard—it shows the type of license each LLM has. This is important because it divides the models into two main groups: open source and closed source.

Open Source LLMs

The code behind the Open Source LLMs is publicly available. This allows anyone to inspect, understand, and even improve the model. This fosters a collaborative development environment.

Freely Available: These models have permissive licenses like Apache 2.0 or MIT, allowing unrestricted use (e.g., Mixtral-8x22b-Instruct, Zephyr-ORPO, Starling-LM-7B-beta, OpenChat-3.5, Zephyr-7b-beta).
Limited Use: Some open-source models might have restrictions attached to their licenses. These restrictions could limit commercial use (e.g., Creative Commons licenses) or restrict modifications (e.g., Copyleft licenses).(e.g., Command R+, Llama 3 ).

Closed Source LLMs

LLMs that are not publicly available and require permission or licensing to use. These are typically developed by commercial entities. (e.g., OpenAI’s GPT-4 series, Google’s Gemini series, Anthropic’s Claude series).

In short, open source LLMs offer transparency and foster collaboration, while closed-source LLMs prioritize control and potentially deliver a more polished user experience.

How does LMSYS Arena Works?

The LMSYS platform works by collecting user dialogue data to evaluate large language models (LLMs). Users can compare two different LLMs side-by-side on a given task and then vote on which LLM provided a better response. The LMSYS platform uses these votes to rank the different LLMs.

Here’s a step-by-step breakdown of how LMSYS works:

Go to LMSYS platform > ⚔️ Arena (side-by-side) and select any two different LLMs that you want to compare.

Then provide a task or prompt for the two LLMs to complete. This task can be anything that can be evaluated by a human, such as writing a poem, translating a language, or answering a question. Here I asked the models: Write a 700 words article on Top Open Source LLMs.

You’ll see two answers from different LLMs side by side. Pick the one you prefer. If you don’t like either, you can pick “Both are bad” or “Tie”.

The LMSYS platform will then use your vote to update the rankings of the two LLMs. The specific way in which the rankings are updated is based on the Bradley-Terry model, which is a statistical model that can be used to rank items based on pairwise comparisons.

LMSYS Leaderboard Evaluation System

The LMSYS leaderboard uses two main ways to rate Large Language Models (LLMs): the Elo rating system and the Bradley-Terry model.

Elo Rating System: This system, which is also used in chess, gives each LLM a score based on its performance. If an LLM wins a match, it gains points, but it loses points if it loses. The difference in points between two LLMs shows which one is likely stronger and more likely to win in future matches.
Bradley-Terry Model: This method is a bit more detailed than the Elo system. It looks at things like how tough the tasks are that the LLMs handle, giving a more detailed look at how well each LLM performs.

In the LMSYS Chatbot Arena, LLMs are like players in a game, where they interact with users and compete against each other. Each LLM starts with a basic score, and this score changes based on whether they win or lose matches. Winning against a stronger LLM gives more points, and losing to a weaker one takes away more points. This way, the ratings always reflect the current strengths of the LLMs accurately.

The Elo system is great for keeping track of how LLMs perform over time, helping to understand which models are doing well and predicting how they might do in the future. This makes it a very useful tool for seeing how new and existing models stack up against each other in the ever-changing world of AI development.

Interested in reading more about the evaluation process, check out their paper: https://arxiv.org/abs/2403.04132

Conclusion

I hope this article has helped you understand how the LMSYS leaderboard works and where you can keep track of the latest developments in large language models.

The LMSYS Chatbot Arena uses a system where users help rank the models, and it uses detailed methods to score them. This makes it a great place to really see how these models perform. Understanding these models better helps everyone use them more effectively in real-life situations.

If you know of any other resources that can help stay up-to-date in the field of Generative AI, please share them in the comments section below. Your input can help us all keep pace with this rapidly evolving technology!

Himanshi Singh 01 May 2024

I am a data lover and I love to extract and understand the hidden patterns in the data. I want to learn and grow in the field of Machine Learning and Data Science.

AI Tools Artificial Intelligence Generative AI Intermediate Large Language Models