Update on 2008 Web Harvest
This project is being undertaken under the Library's legislative mandate for 'collecting, preserving, and protecting documents, particularly those relating to New Zealand, and making them accessible for all the people of New Zealand, in a manner consistent with their status as documentary heritage and taonga' (the 'legal deposit' sections of the National Library Act 2003).
As part of its ongoing web harvesting activities the National Library of New Zealand recently began a 'whole-of-domain' web harvest project to collect:
- Web pages and documents on hosts that fall under the .nz top-level domain.
- Web pages and documents on hosts that have been redirected from a host in the .nz top-level domain.
- Web pages and documents from a handpicked list of about 200 websites known to be in the New Zealand domain.
- Image, video clips, and other files that are embedded in any of the web pages identified above.
As part of this process the Library has made a decision not to honour the robots.txt protocol.
This decision was not taken lightly as we understand and respect the purpose and practice of robots.txt. The reason for taking this decision was to enhance the likelihood of our being able to harvest as full a snapshot of the .nz domain as possible. The Library recognises that this policy can cause problems for websites and will happily change the crawler’s behaviour at the webmaster’s request. This might entail crawling more slowly, crawling while complying with robots.txt, or only crawling selected parts of the website.
The harvester is configured to discover as many hosts as possible, and to then perform partial harvests of each host. Eventually, we hope these partial harvests will add up to a full harvest. In practical terms, this means webmasters can expect the harvester to work in bursts, taking 100 URLs from each website before moving on the next. Eventually the harvester will cycle back around to collect the next 100 URLs from the site. The exceptions to this are Government, Research, and Maori sites (.govt.nz, .ac.nz, .cri.nz and .maori.nz) where we harvest 500 URLs at a time.
This means we will generally only spend a short amount of time on each site before moving on. There are some difficult cases though that can cause problems. It is possible for the crawler to get stuck (for example, in a crawler trap, or even just in a Wiki or calendar), though we monitor for these occurrences and fix them as quickly as we can. Another potential problem is that websites that are distributed across several hosts might get crawled several times at once, resulting in heavy traffic to the web server.
When these problems occur (or others like them) we endeavour to fix them as quickly as possible. The Library has contracted a third party to undertake this project on our behalf (the Internet Archive in the United States, developers of the well known Wayback Machine).
Due to time zone issues we may not be able to respond immediately to changes to adjust our crawling parameters but please be assured that we will respond to all such requests as soon as we are able to.
More about the New Zealand Web Harvest 2008
Answers to questions about the Web Harvest (20 October 2008)
