Here are some sites that evaluate the performance of LLMs (large language models).
Hugging Face – Open LLM Leaderboard
The site tracks the performance of various LLMs in real-time and publishes rankings. It offers interactive tools that allow you to compare the performance of models based on multiple benchmarks.
Access here: Open LLM Leaderboard on Hugging FaceHugging Face HuggingFace.

Artificial Analysis – LLM Leaderboard
It compares multiple LLMs based on metrics such as quality, cost, speed, and context window length. Specifically, it evaluates popular models such as GPT-4, Llama, and Mistral.
You can find out more here: Artificial Analysis LLM LeaderboardAI Model Analysis.
You can check these sites for performance comparisons and detailed evaluation results of the latest LLMs.
Chatbot Arena Leaderboard (UC Berkeley SkyLab and LMSYS)
Hugging Face’s Chatbot Arena Leaderboard ranks various chatbot models based on user feedback and multi-turn conversation benchmarks. Users can submit their models for evaluation, and the models are compared based on performance metrics such as conversation coherence, engagement, and accuracy.

LMArena
It disseminates the latest research, analysis, tools, and community activities on evaluating and comparing large language models (LLMs).
- Arena Explorer: Categorize and analyze large amounts of user conversation data collected by Chatbot Arena into hierarchical topics
- WebDev Arena: LLM Evaluation in Hands-on Web App Development Tasks
- Copilot Arena: LLM Performance Comparison in Code Completion and Editing
- RepoChat Arena: LLM Comprehension Assessment for GitHub Repositories
- Arena-Hard: Building High-Quality Benchmark Datasets

Japanese Language Comprehension Benchmark JGLUE
JGLUE is a benchmark dataset for evaluating Japanese natural language processing models provided by Yahoo Japan in collaboration with Waseda University. It specializes in Japanese language comprehension and caters to various tasks (e.g., question answering, natural language reasoning, contextual understanding, etc.). JGLUE is used as a useful foundation for developing and improving Japanese NLP models, allowing for assessments that focus on language processing specific to the Japanese language.

The Rakuda Ranking of Japanese AI
YuzuAI’s “Camel Ranking” evaluates the performance of Japanese large language models (LLMs). Each model is given a Japanese question, and GPT-4 compares the answers to determine whether it performs well or not.
Halcination performance evaluation
We provide a benchmark to evaluate how often large language models (LLMs) generate hypothetical answers (hallucinations) to misleading questions based on provided documents. Provided by X (@lechmazur)
Related Articles – Performance Evaluation of General Purpose AI Assistants


Comments