Online data are a huge resources for research as well as in practice. Although it is often tempting to “scrape everything” using technologies like web-crawling, it is extremely important to keep the goal of the analysis in mind. Are you trying to build a predictive model? A descriptive model? How will the model be used? Deployed to new records? etc.
What you need to do before online data collection?
Data colllection is collecting useful intelligence for making decisions such as product price determination. Nowadays, available on websites, directories, B2B/B2C platforms, e-books, e-newspaper, yellow page, official data, accessible databases, vast and updated information online encourages more people to collect data from the Internet. Before data mining, you still need to be well prepared, as the ancient Chinese saying “Preparedness ensures success, unpreparedness spells failure.”
- Why do you want to collect intelligence or what’s your objective? What will you do with this intelligence after collection? Making a description of your project can help the data mining team have a better understanding of your aim. Taking an example, an objective can be I want to collect enough intelligence to determine a competitive price for my product.
- What type of information you need to collect to support your final analysis / decision? Such as, if you want to collect the prices of similar product, product specification are necessary to collect for comparison of the same one. The external factors like coupon, gifts or tax also need to be considered for accuracy.
- Where? General searching using keywords or gathering data from specific resources or database depends on project nature. The information from e-commerce websites would be a great avenue for price gathering and product specification.
- Who? Will you collect the data by using the resources of your own or outside resources? Outsourcing of online research work to lower wages countries with the accessibility of internet capabilities and vast English educated personnel like China would be an option for cutting cost. The people who are going to do the work need training and necessary resources on that.
- How? Always remember your purpose of collecting data to improve the collection process. The methodology and process need to be defined to ensure accurate and reliable data. Decisions making on wrong data would result in serious problems.
I’ve summarized 5 tips from myself and my clients’ experiences, hopefully to provide some insights for you. If you have any opinion or experience in online data gathering or outsourcing, please share with us or contact me directly.