Whether you're focused on free speech or on moderation, understanding biases in LLMs - and in the case of this project, biases in LLM-judges - is critical. Against this backdrop, xlr8harder's Speechmap project is a very important initiative.
The Speechmap project comes with a public repository including all questions, responses and llm-judge analyses. Our Speechmap Explorer project leverages this highly valuable dataset for different use-cases. We currently expect our project to span over several months ; progress will be logged on this page.
New Dataset
We re-submitted most responses other LLM judges (2 mistral-small models, qwen3-next-80b-a3b, gemma3-27b, deepseek-v3.2) to compare classifications against the original gpt-4o assessments. 88% of the time, all judges agreed. We reviewed and/or annotated conflicting classifications manually.
Data has been indexed slightly differently, some columns have been added and others have been removed. The resulting datasets, sufficiently small to be loaded reasonably quickly, have been uploaded to HuggingFace. However, our dataset typically lags a couple of months behind Speechmap. Refer to the original Github repo for the full speechmap dataset.
- 2.4k questions: speechmap-questions
- 369k responses: speechmap-responses
- 2.07m LLM-judge assessments: speechmap-assessments. The assessment dataset combines the original assessments from the Speechmap project by gpt-4o, assessments by mistral-small-3.1-2503 (local, 8bit), mistral-small-3.2-2506 (local, 8bit), gemma3-27b-it (Google API), deepseek-v3.2 (Deepseek API), qwen3-next-80B-A3B-instruct (local, 8bit) and manual annotations.
Note that data in the original llm-compliance repo covers model outputs that may be subject to individual LLM licenses. Annotations and classifications by LLMs judges are published under the same licenses as the original LLMs. Manual annotations and classifications are published under CC-BY 4.0 license.
We currently plan to publish all question-response pairs that were manually-annotated or validated unanimously by LLM-judges in order to train classifiers. We expect to format the question-response pairs like the Minos v1 classifier, but please reach out if you think of a better idea.
We also plan to provide a full write-up of our data work covering the key lessons learnt regarding judge prompts, judge models, judge-model hosting options, questions that models really struggle with... We will continue to re-classify all the dataset with different methods and different classifications.
Typescript App
An open-source, interactive TypeScript app was also published to explore the dataset and compare differences in response assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.
Core features:
- Compare Any Two Judges: Select any two LLM judges from the dataset to compare their assessments side-by-side.
- Filter by Theme: Narrow down the analysis to specific topics or domains by filtering by question theme.
- Sankey Chart: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
- Transition Matrix (Heatmap): Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
- Drill-Down to Details: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.
- There are deactivated features to facilitate manual annotations. They can be reactivated or adapted from the ItemList file (src/components/itemList.tsx)
The app was deployed on HuggingFace but it can also be run locally. See installation instructions in the README.md file. Upon installation, the three parquet files covering the entire dataset are fetched from HuggingFace and a duckdb database is built at the root of the project. Using Docker, the initial set-up takes about 1 minute.
The initial project of switching the backend to duckdb-wasm instead of the Node.js so the application would then run entirely in the browser was not conclusive. The key issue was the persistence of the database, which proved more complex than initially expected. In any case, a HuggingFace Space using Docker appeared to be a more user-friendly approach for those who just want to browse the data.
Classifiers
The final part of the project will involve training different types of long-context classifiers on this data.
This would be a perfect opportunity to explore in depth the recent innovations in the field, both for long-context encoders and small decoders trained for classification tasks.
Should the results be supportive, we believe that such cheap classifiers would be highly valuable for researchers and users : there is no such thing as an unbiased LLM so, if people are to use LLMs for personal or professional use, they ought to make sure that the biases align with their expectations. As a matter of fact, government bodies in the US have already shared guidelines regarding this type of “bespoke” alignment (see US action plan or this executive order). But what’s missing today is cheap and fast tools to measure subtle nuances of alignment while guaranteeing transparency. Solutions offered by OpenAi or Anthropic are simply not good enough.
We believe that such tools will become very important as more organisations or administrations want to assess model behaviour before selecting a vendor. Some people may want extremely moderated models along all axes, others may want no moderation at all. Irrespective of our opinion on the matter, we’ll consider this project a success if our classifiers allow people to make an informed choice of model rather than being forced to blindly adopt an off-the-shelf solution.
Acknowledgments
The Speechmap Explorer project only exists because the Speechmap project exists. Make sure to check out the Speechmap project website, where you can browse the original dataset in great detail. Please support that project if you can. Thanks!
