Projects — Papers Past newspaper open data
Wondering what you can do with digitised newspaper data? Have a look at the examples on this page for inspiration. The examples use a variety of newspaper data from different sources.
Creating specialized corpora from digitised historical newspaper archives
When confronted with huge collections of digitised material, it is very hard for an individual researcher or small team of researchers with interests in a specialised topic to apply contemporary text mining and other computational methods.
Joshua Black has developed an approach to help solve this problem. The core idea is that the text mining methods used to gain insight into a specialised topic can also be used to generate increasingly focused corpora.
Creating specialized corpora from digitized historical newspaper archives: An iterative bootstrapping approach — Joshua's academic paper
New Publication: Creating Specialised Corpora from Digitized Historical Newspaper Archives — Joshua's blog about his academic paper
Newspaper Navigator
Newspaper Navigator is a project by Ben Lee during his time as an Innovator-in-Residence at the Library of Congress. The first stage of Newspaper Navigator was to extract content such as photographs, illustrations, cartoons, and news topics from the Chronicling America newspaper scans and corresponding OCR using emerging machine learning techniques.
The project has successfully pulled together millions of images from 1789 to 1963 and made them searchable as a discrete set.
Newspaper Navigator — information about the project including link to pre-packaged datasets.
The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America — research article about Newspaper Navigator.
Oceanic Exchanges
Oceanic Exchanges is an international project using newspaper data from six countries (Finland, Germany, Mexico, the Netherlands, the United Kingdom, and the United States) to examine patterns of information moving across national and linguistic boundaries.
Viral Texts
Ryan Cordell, Associate Professor of English, Northeastern University in the United States, runs the Viral Texts project, which uses data, visualisations, and text to explore how news articles, short stories, and poems spread throughout nineteenth century newspapers.
Kumara Times
Former National Library Digitisation Advisor Greig Roulston used data from Papers Past to first build a timeline of Louis Louisch’s life based on articles from the Kumara Times, and secondly, to analyse the advertisements by using animation and AdBlock.
Mining the Kumara Times for Gold, with machines (25 mins, YouTube) — See Greig’s presentation on the Kumara Times at the 2017 National Digital Forum on YouTube
Examining the WWI Papers Past corpus
Programmer and artist Douglas Bagnall examined the reporting around World War I, using data from newspapers on Papers Past published between 1913 and 1922, to see if the use of particular terms could be mapped over time. The data Douglas used had been converted into JSON and contained the digitised text and a limited amount of metadata.
article: {type: “War reports in all CAPITALS” — A blog by Emerson Vandy, Services Manager for Papers Past, that provides some context and history for Douglas’ work.
QueryPic
Historian and hacker Tim Sherratt used newspaper data from both Trove and Papers Past to build QueryPic — a tool that graphs the results of keyword searches in newspapers over time.
QueryPicNZ — Tim explains how he developed QueryPicNZ using the DigitalNZ API
A tale of two islands — blog by former National Library staff member Gordon Paynter about QueryPic
The Battle Times
After gathering up a band of rogues to build a prototype at the National Digital Forum 2013 hackathon, Greig Roulston started to flesh out what a card game might look like if built by using Papers Past articles to ‘roll’ the cards (via the DigitalNZ API). Unfortunately the project was never finished.
Cards against the Library — read about about Greig's (abandoned) plans of world domination.
Other newspaper open datasets
Atlas of Digitised Newspapers and Metadata — an open-access guide to 10 newspaper databases worldwide.
Chronicling America — API and bulk data from the Chronicling America: Historic American Newspapers website.
Data Foundry — data collections from the National Library of Scotland.
Historical Newspapers open data — data from the Bibliothèque nationale du Luxembourg, (National Library of Luxembourg).
Newspapers as data: A collections as data project by University of Arizona Libraries — a programme designed to introduce students to data literacy and computational analysis using digitized historical newspapers from Arizona.
Trove Bulk Download — Trove’s two sample bulk downloads of digitised data.
Get in touch
Get in touch if you know of any other examples that you think we should include or if you've created something you'd like us to showcase here.
We'd also love to hear how you've found using the data, what's gone well, what hasn't worked or what might make things easier.
Email us — paperspast@natlib.govt.nz
Related content
Dataset — Papers Past newspaper open data
Download individual newspaper titles or get all the Papers Past open data. Each year of data includes mets.xml and page-level alto.xml for each issue published that year. It does not include images.Copyright and re-use — Papers Past newspaper open data
Information about copyright and re-use of the data in the Papers Past open dataset.Feature image at top of page: Image created by Greig Roulston from pictures from the pilot dataset.