French-Public Domain-Newspapers or French-PD-Newpapers is a large collection aiming to agregate all the French newspapers and periodicals in the public domain.
The collection has been originally compiled by Pierre-Carl Langlais, on the basis of a large corpus curated by Benoît de Courson, Benjamin Azoulay for Gallicagram and in cooperation with OpenLLMFrance. Gallicagram is leading cultural analytics project giving access to word and ngram search on very large cultural heritage datasets in French and other languages.
As of January 2024, the collection contains nearly three million unique newspaper and periodical editions (69,763,525,347 words) from the French National Library (Gallica). Each parquet file has the full text of a few thousand selected at random and, when available, a few core metadatas (Gallica id, title, author, word counts…). The metadata can be easily expanded thanks to the BNF API.
This initial agregation was made possible thanks to the open data program of the French National Library and the consolidation of public domain status for cultural heritage works in the EU with the 2019 Copyright Directive (art. 14).
The composition of the dataset adheres to the French criteria for public domain of collective works (any publication older than 70 years ago) and individual works (any publication with an author dead for more than 70 years). In agreement with the shorter term rules, the dataset is in the public domain everywhere.