
Data Provenance Explorer
Date : 2023-10-25
Introduction
The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models. As a first step, we've traced 1800+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.
Access web app here
Recently on :
Artificial Intelligence
WEB - 2025-11-13
Measuring political bias in Claude
Anthropic gives insights into their evaluation methods to measure political bias in models.
WEB - 2025-10-09
Defining and evaluating political bias in LLMs
OpenAI created a political bias evaluation that mirrors real-world usage to stress-test their models’ ability to remain objecti...
WEB - 2025-07-23
Preventing Woke AI In Federal Government
Citing concerns that ideological agendas like Diversity, Equity, and Inclusion (DEI) are compromising accuracy, this executive ...
WEB - 2025-07-10
America’s AI Action Plan
To win the global race for technological dominance, the US outlined a bold national strategy for unleashing innovation, buildin...
WEB - 2024-12-30
Fine-tune ModernBERT for text classification using synthetic data
David Berenstein explains how to finetune a ModernBERT model for text classification on a synthetic dataset generated from argi...