New Zealand Web Harvest FAQs
Published 20 October 2008; updated 21 October and 29 October 2008, and February 2009 (updates noted in text)
General
Why are you harvesting New Zealand websites?
The National Library of New Zealand has a social responsibility to preserve New Zealand's social and cultural history, be it in the form of books, newspapers and photographs, or of websites, blogs and YouTube videos.
The public benefit from the safe, long-term preservation of New Zealand's online heritage is incalculable. Our online social history and much government and institutional history will be preserved for researchers, historians, and ordinary New Zealanders. We will be able to look back on internet documents as we do the printed words left to us by previous generations.
How is the harvest progressing? How much have you downloaded? (Updated 29 October 2008)
The domain harvest finished slightly ahead of schedule on Thursday 23 October 2008. The harvester collected 105 million URLs, about 10 million every day. It harvested about 4.1 terabytes of data, which compresses down to slightly less than 3 terabytes (a terabyte is 1,024 gigabytes, or 1.05 million megabytes).
During the week of the 28 October 2008 we plan one more short harvest to capture a set of websites whose owners emailed us to ask to be included in the harvest. We do not expect this to have much effect on the overall size of the harvest.
Where will the harvest be kept? Will it be open to the public? (Added 29 October 2008)
The harvested data will be transferred to the National Library of New Zealand by the end of this 2008, where it will be securely stored. We hope to then be able to release more detailed statistics.
We eventually hope to provide public access, but there are a lot of issues to resolve first, and we will need time to study the data and consider the best way to make it available.
What type of web harvesting does the Library do?
The National Library undertakes two streams of web archiving: selective harvesting and domain harvesting.
Selective archiving is where curators select high-value websites for inclusion in our collections, and then harvest them using the Web Curator Tool. The Library has been selectively harvesting for several years. Since we started harvesting with the Web Curator Tool in January 2007, we have run about 2,500 selective harvests. The current selective harvesting programme focuses on the upcoming General Election.
The New Zealand Web Harvest 2008 is a domain harvest. Domain harvesting is an attempt to harvest as much material as is technically possible with a minimum of human intervention. It is called "domain harvesting" because the simplest approach is to try to harvest an internet domain, such as the nz (or ".nz") domain for New Zealand.
Will you repeat the harvest? How often? (Added 21 October 2008)
While we have not planned any further harvests at this time, it is likely that domain harvests will become a feature of the Library’s overall web harvesting programme. Analysis of the current harvest and research into various access issues will help determine frequency.
As one lesson learned from the current process we will endeavour to communicate our plans through appropriate fora prior to future harvests.
Will we be able to view pages captured by the web harvests? (Added 19 February 2009)
Yes, but probably not for several months. We're still analysing the data that we have captured, and there are several issues related to public access that we have to resolve (privacy, take-down policy, evidential value, etc) before we make the web harvest available. There are also some technical issues.
This might take some time, but we are working on it, and we are very keen to make the 2008 web harvest, and all our other (selective) harvests, publicly accessible.
Crawl scope
Exactly what websites are you harvesting?
The scope of the harvest is dictated by our legal deposit legislation, as described on the New Zealand Web Harvest 2008 web page.
Where did you get the list of host names to harvest?
We got the list of hosts from several sources:
1. The crawl engineers made a list of all *.nz hosts that had previously been encountered as part of the Internet Archive's normal harvesting activities. We will also harvest any other *.nz hosts encountered during the harvest (all *.nz hosts are within the scope of legal deposit.)
2. The crawl engineers used several available services (eg the Alexa web search API) to look up the names of hosts that are physically in New Zealand but not registered in the nz domain.
3. We are also harvesting hosts that we have previously harvested with the Web Curator Tool as part of our ongoing selective web archiving program.
4. The crawl engineers ran a test crawl in early October and made a list of all the hosts that were the targets of redirects from hosts in New Zealand (e.g. yahoo.co.nz redirects to nz.yahoo.com). This list was too broad to use in its entirety but we quickly went through and hand-picked a further set of non-nz hosts that are within the scope of the harvest parameters.
How can I get my website into the harvest? (Updated 29 October 2008)
The harvest is now complete, and it is too late to get your website into the harvest.
If you have already submitted a website, it will be harvested between 24 October and 28 October 2008.
I run several large New Zealand websites in .co.nz containing literally tens of millions of pages plus unknown quantities of dynamic pages. Thousands of pages change daily and total content is several hundred gigabyte, and a lot of that is video and imagery. Do you intend to download all of the content from all of my sites?
In principle, yes.
In practice the internet is infinitely large, because of the large number of dynamic pages, and it is impossible to harvest everything. We will therefore have to stop harvesting at some point. Our initial target for this harvest is 100 million URLs (we may extend this to 150 million). This number has been chosen based on the experience gathered from similar Australian harvests.
The Australian web domain harvests: a preliminary quantitative analysis of the archive data [PDF]
Another factor to consider is that we want this crawl to be as broad as possible, and to capture as many hosts in the nz domain as possible. As we will harvest every host at a similar speed, it is very likely that small hosts will be completely harvested, but that large hosts will only be partially harvested.
Why are you harvesting my site when it has nothing to do with New Zealand?
The National Library of New Zealand (Te Puna Mātauranga o Aotearoa) Act 2003 sets out what is (and is not) in scope for legal deposit.
In simple terms, we are mandated to make a copy of internet publications that are published by New Zealanders and/or in New Zealand. This includes any internet publications produced in New Zealand or commissioned to be produced outside New Zealand by a person resident in New Zealand or whose principal place of business is in New Zealand. It includes some material you may not expect (such as sites hosted by New Zealand companies for their overseas clients) and excludes material that we might like to harvest (such as material hosted overseas and with a New Zealand focus).
Why have you not collected all the New Zealand content that is outside the nz domain?
Our list of non-nz hosts (which we either have explicit permissions for or which have been commissioned from within New Zealand) only represents a small part of the non-nz hosts out there. We would like to harvest more, but it is very difficult to detect this content reliably in a large-scale, automated fashion. Our selection is therefore hand-vetted.
The National Library of the Czech Republic is working on automating this process using whois lookups, and looking for Czech phone numbers, names, language and email addresses on web pages, but this technology is in its infancy and could not be used for this harvest.
Are you harvesting images posted to Flickr, movies posted to YouTube, and other similar content that has been uploaded by New Zealanders?
Not as part of this crawl. For the New Zealand Web Harvest 2008 we are harvesting on a host-by-host basis only. We're not taking parts of websites or selecting content that is posted by New Zealanders on third-party websites. However, much of this content may be in the scope of the legal deposit legislation, and we may try to harvest it as part of other projects.
Technology
What web crawler is being used to perform the New Zealand Web Harvest 2008?
The Internet Archive's Heritrix web harvester (version 1.14) is being used for this crawl.
Are you using the Web Curator Tool?
The Web Curator Tool is a web harvesting tool developed jointly by the National Library and the British Library.
The National Library's selective web archiving programme uses the Web Curator Tool. Selective archiving is where curators select high-value websites for inclusion in our collections, harvest them, and quality-review them to ensure the website is captured completely.
Domain harvesting is an entirely different challenge, and the Web Curator Tool is not suitable. Instead, we have commissioned the Internet Archive to perform the harvest using Heritrix.
Why am I seeing the Web Curator Tool in my web logs?
Our selective harvesting programme is still in operation, and some website owners may notice requests from the Web Curator Tool software in their web logs. The Web Curator Tool will generally behave in a similar fashion and work at a similar pace to the domain harvester, but will crawl a site in one session, whereas the domain harvester will harvest 100 or 500 URLs at a time. Both harvesters ignore robots.txt.
Does the harvester use cookies?
Yes, the harvester collects cookies as it goes.
Why are you harvesting from the USA and not from New Zealand? (Added 21 October 2008, updated 29 October 2008)
We have contracted the Internet Archive to conduct the harvest because they are the single most experienced provider of large-scale crawling services in the world.
The Internet Archive is a non-profit organisation founded to build an Internet library, with the purpose of offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format. We hope that after observing the experts at work we'll be able to manage future harvests from within New Zealand and will be engaging with key stakeholders to discuss future harvests.
Robots policy
Why are you ignoring robots.txt?
We have addressed this point directly in our 15 October update.
Why did you not notify all the webmasters before the harvest? (Added 21 October 2008)
We could not see a good way to do so without effectively becoming spammers. In hindsight we could have communicated better with webmasters. When we decide to run the harvest again, we will make more of an effort to publicise the harvest in mailing lists and groups frequented by webmasters.
If you ignore the robots.txt, won't the crawler will get stuck in spider traps?
The crawler is likely to get stuck in traps whether we ignore robots.txt or not. This is one reason that we have engaged the Internet Archive to manage and monitor this crawl, as they have a lot of experience avoiding crawler traps.
The main purpose of using robots.txt is to stop crawlers going in loops using dynamic URLs or getting to areas where human interaction is a must. Why ignore it?
The problem is that many well-designed websites use robots.txt to guide crawlers away from traps, but because robots.txt is not a standard, other websites use robots.txt in different ways. For example, they may throw up a blanket denial or use the protocol to restrict access to specific URLs that we want to harvest, such as the CSS or image files necessary to reproduce a page. We know from our selective harvesting experience that if you honour robots.txt, you will get a very poor result for many sites (or no result at all).
Is everything posted really meant to be public? Doesn't the use of robots.txt indicate a desire for privacy?
In some cases collecting web pages is similar to collecting people's diaries (which national libraries also do). One of the interesting ethical issues that web harvesting throws up is that while internet documents are legally "publications", the people posting them might think of internet content as "communication", and expect some level of privacy.
At the moment, the web archiving community doesn't have good answers to these issues. What we do know, however, is that a lot of the web is at risk of simply disappearing, and the longer we wait before harvesting, the more likely it is that important documents will be lost forever.
If you ignore robots.txt, what's to stop me blocking your crawler's IP address?
Nothing.
Some webmasters have taken this action, and we're sorry they felt they had to go to these lengths. We are running this harvest with good intentions, and ask that if you have blocked us, you reconsider – for example by allowing the harvester to access your site on the condition that it honours robots.txt. We'd much prefer this outcome to getting nothing from your websites at all.
Please remember that this project is about trying to ensure that as much as possible of the social history being enacted on the web today is available to researchers and all New Zealanders in the future. If we don't capture it now, we may not have the chance later.
