In the new era of large amounts of publicly available data, an issue that is sometimes overlooked is ethical data collection. Whereas for experimental studies involving humans we have clear guidelines and an organizational process for assessing and approving data collection (in the US, via the IRB
), collecting observational data is much more ambiguous. For instance, if I want to collect data on 50,000 book titles on Amazon, including their ratings, reviews, and cover images – is it ethical to collect this information by web crawling? A first thought might be “why not? the information is there and I am not taking anything from anyone”. However, there are hidden costs and risks here that must be considered. First, in the above example, the web crawler will be mimicking manual browsing, thereby accessing Amazon’s server. This is one cost to Amazon. Secondly, Amazon posts this information for buyers for the purpose of generating revenue. When one’s intention is not to actually purchase, then it is misuse of the public information. Finally, one must ask whether there is any risk to the data provider (for instance – maybe too heavy access can slow down the provider’s server, thereby slowing down or even denying access to actual potential buyers).
When the goal of the data collection is research, then another factor to consider is the benefits of the research study to society, to scientific research or “general knowledge”, and perhaps even to the company.
Good practice involves consideration of the costs, risks, and benefits to the data provider and accordingly designing your collection and letting the data provider know about your intention. Careful consideration of actual sample size is therefore still important even in this new environment. An interesting paper by Allen, Burk, and Davis (Academic Data Collection in Electronic Environments: Defining Acceptable Use of Internet Resources discusses these issues and offers guidelines for “acceptable use” of internet resources.
These days more and more companies (e.g., eBay and Amazon) are moving to “push” technology, where they make their data available for collection via API and RSS technologies. Obtaining data in this way avoids the ethical and legal considerations, but one is then limited to the data that the data source has chosen to provide. Moreover, the amount of data is usually limited. Hence, I believe that web crawling will continue to be used, but in combination with API and RSS the extent of crawling can be reduced.