

Whether you're focused on free speech or on moderation, understanding biases in LLMs - and in the case of this project, biases in LLM-judges - is critical. Against this backdrop, xlr8harder's Speechmap project is a very important initiative.
The Speechmap project comes with a public repository including all questions, responses and llm-judge analyses. Our Speechmap Explorer project leverages this highly valuable dataset for different use-cases. We currently expect our project to span over several months ; progress will be logged on this page.
New Dataset
We re-submitted most responses to another LLM judge (Mistral Small) to compare classifications against the original GPT-4o assessments. We are in the process of annotating conflicting classifications manually.
Data has been indexed slightly differently, some columns have been added and others have been removed. Refer to the original Github repo for the full dataset. The resulting datasets, sufficiently small to be loaded reasonably quickly, have been uploaded to HuggingFace
- 2.4k questions: speechmap-questions
- 350k responses: speechmap-responses
- 875k LLM-judge assessments: speechmap-assessments. The assessment dataset combines the original LLM-assessments from the llm-compliance repo (`gpt-4o-2024-11-20`), a run with `mistral-small-3.1-24b-instruct-2503`, manual annotations and a run with `mistral-small-3.2-24b-instruct-2506`. More manual annotations and assessments by more open source models will be added shortly.
Note that data in the original llm-compliance repo covers model outputs that may be subject to individual LLM licenses. Mistral Small 3.1 and Mistral Small 3.2 are licensed under Apache 2.0 and, accordingly, the classification dataset is published under a permissive license. We will publish our manual annotations under a permissive licence too.
We currently plan to publish all permissively-licensed and manually-annotated question-response pairs which can be used to train classifiers. We expect to format the question-response pairs like the Minos v1 classifier, but please reach out if you think of a better idea.
We also plan to provide a full write-up of our data work covering the key lessons learnt regarding judge prompts, judge models, judge-model hosting options, questions that models really struggle with... We will continue to re-classify all the dataset with different methods and different classifications.
Typescript App
An open-source, interactive TypeScript app was also published to explore the dataset and compare differences in response assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.
Core features:
- Compare Any Two Judges: Select any two LLM judges from the dataset to compare their assessments side-by-side.
- Filter by Theme: Narrow down the analysis to specific topics or domains by filtering by question theme.
- Sankey Chart: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
- Transition Matrix (Heatmap): Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
- Drill-Down to Details: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.
- There are deactivated features to facilitate manual annotations. They can be reactivated or adapted from the ItemList file (src/components/itemList.tsx)
The app was deployed on HuggingFace but it can also be run locally. See installation instructions in the README.md file. Upon installation, the three parquet files covering the entire dataset are fetched from HuggingFace and a duckdb database is built at the root of the project. Using Docker, the initial set-up takes about 1 minute.
The initial project of switching the backend to duckdb-wasm instead of the Node.js so the application would then run entirely in the browser was not conclusive. The key issue was the persistence of the database, which proved more complex than initially expected. In any case, a HuggingFace Space using Docker appeared to be a more user-friendly approach for those who just want to browse the data.
Classifiers
The final part of the project will involve training different types of long-context classifiers on this data.
This would be a perfect opportunity to explore in depth the recent innovations in the field, both for long-context encoders and small decoders trained for classification tasks.
Should the results be supportive, we believe that such cheap classifiers would be highly valuable for researchers and users : there is no such thing as an unbiased LLM so, if people are to use LLMs for personal or professional use, they ought to make sure that the biases align with their expectations. As a matter of fact, government bodies in the US have already shared guidelines regarding this type of “bespoke” alignment (see US action plan or this executive order). But what’s missing today is cheap and fast tools to measure subtle nuances of alignment while guaranteeing transparency.
We believe that such tools will become very important as more organisations or administrations want to assess model behaviour before selecting a vendor. Some people may want extremely moderated models along all axes, others may want no moderation at all. Irrespective of our opinion on the matter, we’ll consider this project a success if our classifiers allow people to make an informed choice of model rather than being forced to blindly adopt an off-the-shelf solution.
Acknowledgments
The Speechmap Explorer project only exists because the Speechmap project exists. Make sure to check out the Speechmap project website, where you can browse the original dataset in great detail. Please support that project if you can. Thanks!
