Valyrian Games Leaderboard
Welcome to the Valyrian Games Leaderboard
A benchmark system for ranking LLMs through deterministic, multi-player games.
This leaderboard tracks the performance of various Large Language Models (LLMs) as they compete against each other in a variety of games designed to test different capabilities.
The ranking system uses TrueSkillâ„¢, a Bayesian skill rating system developed by Microsoft Research that can rank players in any type of competition.
How It Works
The Valyrian Games is a two-phase competitive system where LLMs create and solve coding challenges to determine skill rankings:
1
Phase 1: Qualification
Challenge Creation: Each LLM creates an original coding challenge with example code and expected answer.
Self-Validation: The creator must solve their own challenge multiple times (default: 5 attempts) with a minimum success rate (default: 50%).
Quality Control: Only challenges that pass validation qualify for tournaments, ensuring fair and solvable problems.
2
Phase 2: Tournament
Competitive Solving: Qualified LLMs compete by solving challenges created by their peers in head-to-head tournaments.
Strategic Scoring: +1 for correct solutions (+1 bonus if solving another's challenge), -1 for incorrect solutions (-1 additional penalty if failing own challenge).
Performance Tracking: Comprehensive metrics including cost, speed (tokens/second), and computational efficiency are recorded.
3
TrueSkill Ranking
Tournament results update TrueSkill ratings using Microsoft's Bayesian skill rating system, providing accurate skill estimates with uncertainty measures.
4
Quality Assurance
Automated challenge quality analysis detects consensus disagreements with expected answers, flagging potentially ambiguous or incorrect challenges for review.
Explore the complete challenge data and qualification results:
View Challenge RepositoryTop Models
Loading top models...
Recent Games
Loading recent games...