

Whether you're focused on free speech or on moderation, understanding biases in LLMs - and in the case of this project, biases in LLM-judges - is critical. Against this backdrop, xlr8harder's Speechmap project is a very important initiative.
The Speechmap project comes with a public repository including all questions, responses and llm-judge analyses. Our Speechmap Explorer project leverages this highly valuable dataset for different usecases. We currently expect our project to span over several months ; progress will be logged on this page.
New Dataset
We re-submited most responses to another LLM judge (Mistral Small) to compare classifications against the orginal GPT-4o assessments. We are in the process of annotating conflicting classifications manually.
Data has been indexed slightly differently, some columns have been added and others have been removed. Refer to the original Github repo for the full dataset. The resulting datasets, sufficiently small to be loaded reasonably quickly, have been uploaded to HuggingFace
- 2.4k questions: speechmap-questions
- 274k responses: speechmap-responses
- 510k LLM-judge assessments: speechmap-assessments. The assessment dataset combines the original assessments from `gpt-4o` and a new set from `mistral-small`. Manual annotations will be added shortly.
Note that data in the original llm-compliance repo covers model outputs that may be subject to individual LLM licenses. Mistral Small 3.1 is licensed under Apache 2.0 and, accordingly, the classification dataset is published under a permissive license. We will publish our manual annotations under a permissive licence too.
We currently plan to publish over 240k question-response pairs assessed by Mistral-small-3.1-24b-instruct-2503 or manaully annotated which can be used to train classifiers. We expect to format the question-response pairs like the Minos v1 classifier, but please reach out if you think of a better idea.
We also plan to provide a full write-up of our data work covering the key lessons learnt regarding judge prompts, judge models, questions that models really struggle with... We envisage to re-classify all the dataset with different methods and a different classification.
Typescript App
An open-source, interactive TypeScript app was also puclished to explore the dataset and compare differences in response assessments. This tool helps visualize how different "judge" models classify the same LLM-generated responses, providing deep insights into inter-rater reliability and model behavior.
Core features:
- Compare Any Two Judges: Select any two LLM judges from the dataset to compare their assessments side-by-side.
- Filter by Theme: Narrow down the analysis to specific topics or domains by filtering by question theme.
- Waterfall Chart: Visualize the reclassification flow, showing how assessments from Judge 1 are categorized by Judge 2.
- Transition Matrix (Heatmap): Get a clear, at-a-glance overview of agreement and disagreement between the two selected judges.
- Drill-Down to Details: Click on any chart element to inspect the specific items, including the original question, the LLM's response, and the detailed analysis from both judges.
- Self-assessment branch includes a feature to facilitate manual annotations (WIP)
See installation instructions in the README.md file. Upon installation, the three parquet files covering the entire dataset are fetched from HuggingFace and a duckdb database is built at the root of the project.
Next step is to switch to duckdb-wasm instead of the Node.js backend currently. The application would then run entirely in the browser. This should be faster than the current DuckDb solution and will involve a client-side data persistence strategy (likely using the Origin Private File System) to download and build the database only once, ensuring fast load times on subsequent visits. We would be able to serve the app in a HuggingFace Space.
Classifiers
The final part of the project will involve training different types of long-context classifiers on this data.
This would be a perfect opportunity to explore in depth the recent innovations in the field, both for long-context encoders and small decoders trained for classification tasks. Should the results be supportive, we believe that such cheap classifiers would be highly valuable for researchers.
Acknowledgments
The Speechmap Explorer project only exists because the Speechmap project exists. Make sure to check out the Speechmap project website, where you can browse the original dataset in great detail. Please support that project if you can. Thanks!