The National Library of New Zealand has a social responsibility to preserve New Zealand's social and cultural history, be it in the form of books, newspapers and photographs, or of websites, blogs and YouTube videos. An increasing amount of New Zealand's documentary heritage is only available online. New Zealanders find this content valuable and convenient, but its impermanence, lack of clear ownership, and dynamic nature pose significant challenges to our efforts to collect and preserve it.
The public benefit from the safe, long-term preservation of New Zealand's online heritage is incalculable. Our online social history and much government and institutional history will be preserved for researchers, historians, and ordinary New Zealanders. We will be able to look back on internet documents as we do the printed words left to us by previous generation.
The National Library of New Zealand conducted a national web harvest between 12 May and 5 June 2010. This page is the primary source of information about the harvest.
Current status of the web harvest
The harvest is now complete.
The harvest ran for approximately 24 days, and requested 170 million URLs. The combined downloaded data and log files make up 6.1 terabytes of data.
The data and log files are currently being transported to the National Library in Wellington, where they will be securely stored and analysed.
Why does the National Library collect websites?
The National Library exists to preserve New Zealand's social and cultural history, whether in the form of books, newspapers and photographs, or websites, blogs and videos.
The New Zealand Web Harvest 2010 recognises the importance of the internet in all areas of New Zealand society and culture by taking a 'snapshot' of the New Zealand internet in May 2010.
What is a whole of domain web harvest?
The National Library undertakes two streams of web archiving: selective harvesting and domain harvesting.
Selective archiving is where Library staff select websites for inclusion in our collections. The Library has been selectively harvesting for several years.
Domain harvesting is an attempt to harvest as much material as is technically possible with a minimum of human intervention. It is called "domain harvesting" because the simplest approach is to try to harvest an internet domain, such as the .nz domain for New Zealand.
The National Librarian is authorised to harvest websites by the National Library of New Zealand (Te Puna Mātauranga o Aotearoa) Act 2003 and the Minister’s National Library Requirement (Electronic Documents) Notice 2006.
About the 2010 whole of domain web harvest
The harvest is now complete.
The National Library commissioned the Internet Archive (an American-based not-for-profit) to perform the harvest on our behalf.
We attempted to acquire:
- websites falling under the .nz country code,
- websites falling under .com, .net and .org that can be programmatically determined to be hosted on machines that are physically located in New Zealand, and
- selected websites based overseas that are covered by the provisions of the National Library of New Zealand Act (2003).
The harvest collected publicly viewable web content. If your website, or parts of it, is password protected, this content will not have been be harvested. The web harvester generally honoured the robots.txt convention, with some exceptions.
Keeping you informed
Notice of the harvest was first published on Thursday 8 April March 2010, and the harvest began five weeks later on between 12 May 2010.
The Library kept website owners and other affected parties up to date throughout the harvest via this web page.
Regular progress updates were posted on our LibraryTechNZ blog and Twitter (NLNZwebharvest).
See archived LibraryTechNZ posts about the web harvest
Consultation prior to the 2010 web harvest
The Library conducted its first whole of domain harvest in October 2008. Several issues were raised at this time, including:
- Notification: The harvest was initiated without prior notification to affected parties.
- Robots policy: The harvester was configured to ignore the robots.txt convention unless the website owner contacted the Library to request that it be honoured.
- Location of the harvester: The harvest was operated by the Internet Archive from the United States, and some website owners are charged more for international traffic.
In January 2010 the Library released a consultation document, the Web Harvest Options Paper, for technical and network stakeholders. It provided an overview of the 2008 web harvest, and sought feedback and outlined options to inform this current web harvest.
Consultation closed on 8 February 2010. The results of the feedback were published on 8 April 2010.
The 2008 Harvest
In 2008 the National Library undertook a web harvest of the entire .NZ domain and a list of approximately 500 websites outside the .NZ domain that held New Zealand content (for example, .com, .net).
The domain harvest finished slightly ahead of schedule on Thursday 23 October 2008. The harvester collected 105 million URLs, about 10 million every day. It harvested about 4.1 terabytes of data, which compresses down to slightly less than 3 terabytes.
Our approach to the harvest caused some problems for site owners. While it was absolutely not our attention to disrupt their services, we missed some important issues before the harvest launched. We have worked closely with the web community since the issues were flagged, and were able to launch the 2010 harvest successfully.
Web Harvest update (15 October 2008)
Answers to questions about the Web Harvest (20 October 2008, updated 21 & 29 October)
Contact details
Please contact us if you have any further comments or questions.


