Monday, February 22, 2010

Online data collection

Online data are a huge resources for research as well as in practice. Although it is often tempting to "scrape everything" using technologies like web-crawling, it is extremely important to keep the goal of the analysis in mind. Are you trying to build a predictive model? A descriptive model? How will the model be used? Deployed to new records? etc.

Dean Tau from Co-soft recently posted an interesting and useful comment in the Linked-in group Data Mining, Statistics, and Data Visualization. With his permission, I am reproducing his post:

What you need to do before online data collection?

Data colllection is collecting useful intelligence for making decisions such as product price determination. Nowadays, available on websites, directories, B2B/B2C platforms, e-books, e-newspaper, yellow page, official data, accessible databases, vast and updated information online encourages more people to collect data from the Internet. Before data mining, you still need to be well prepared, as the ancient Chinese saying “Preparedness ensures success, unpreparedness spells failure.”

  1. Why do you want to collect intelligence or what's your objective? What will you do with this intelligence after collection? Making a description of your project can help the data mining team have a better understanding of your aim. Taking an example, an objective can be I want to collect enough intelligence to determine a competitive price for my product.
  2. What type of information you need to collect to support your final analysis / decision? Such as, if you want to collect the prices of similar product, product specification are necessary to collect for comparison of the same one. The external factors like coupon, gifts or tax also need to be considered for accuracy.
  3. Where? General searching using keywords or gathering data from specific resources or database depends on project nature. The information from e-commerce websites would be a great avenue for price gathering and product specification.
  4. Who? Will you collect the data by using the resources of your own or outside resources? Outsourcing of online research work to lower wages countries with the accessibility of internet capabilities and vast English educated personnel like China would be an option for cutting cost. The people who are going to do the work need training and necessary resources on that.
  5. How? Always remember your purpose of collecting data to improve the collection process. The methodology and process need to be defined to ensure accurate and reliable data. Decisions making on wrong data would result in serious problems.
I've summarized 5 tips from myself and my clients' experiences, hopefully to provide some insights for you. If you have any opinion or experience in online data gathering or outsourcing, please share with us or contact me directly.

Thursday, February 11, 2010

Over-fitting analogies

To explain the danger of model over-fitting in prediction to data mining newcomers, I often use the following analogy:
Say you are at the tailor's, who will be sewing an expensive suit (or dress) for you. The tailor takes your measurements and asks whether you'd like the suit to fit you exactly, or whether there should be some "wiggle room". What would you choose?
The answer is, "it depends how you plan to use the suit". If you are getting married in a few days, then probably a close fit is desirable. In contrast, if you plan to wear the suit to work throughout the next few years, you'd most likely want some "wiggle room"... The latter case is similar to prediction, where you want to make sure to accommodate new records (your body's measurements during the next few years) that are not exactly identical to the current data. Hence, you want to avoid over-fitting. The wedding scenario is similar to models built for causal explanation, where you do want the model to fit the data well (back to the explanation vs. prediction distinction).

I just found some nice terminology, by Bruce Ratner (GenIQ.net), explaining the idea of over-fitting:
A model is built to represent a training data... not to reproduce a training data. [Otherwise], a visitor from the validation data will not feel at home. The visitor encounters an uncomfortable fit in the model because s/he probabilistically does not look like a typical data-point from the training data. Thus, the misfit visitor takes a poor prediction

Tuesday, January 26, 2010

Drag-and-drop data mining software for the classroom

The drag-and-drop (D&D) concept in data mining tools is very neat. You "drag" icons (aka "nodes") that do different operations, and "connect" them to create a data mining process. This is also called "graphical programming". What I especially like about it is that it keeps the big picture in your mind rather than getting blinded by analysis details. The end product is also much easier to present and document.

There has been quite a bonanza lately with a few of the major D&D data mining software tools. Clementine (by SPSS - now IBM) is now called "IBM SPSS Modeler". Insightful Miner (by Insightful - now TIBCO) is now TIBCO Spotfire Miner. SAS Enterprise Miner remains SAS EM. And STATISTICA Data Miner by StatSoft also remains in the same hands.

There's a good comparison of these four tools (and two more non-d&d, menu driven tools: KXEN and XLMiner) on InformationManagement.com. The 2006 article by Nisbet compares performance, pricing, and more.

Let me look at the choice of a D&D package from the perspective of a professor teaching a data mining course in a business school. My own considerations are: (1) easy and fast to learn, (2) easy for my students to access, (3) cheap enough for our school to purchase, and (4) reasonably priced for students after they graduate. It's also nice to have good support (when things break down or when you just can't figure something out). And some instructors also like additional teaching materials.

I've had the longest experience with SAS EM, but it has been a struggle. At first we had individual student licenses, where each student had to download the software from a set of CDs that I had to circulate between them. The size of the software choked too many computers. So we moved to the server version (that allows students to use the software through our portal), but that has been excruciatingly slow. The server version is also quite expensive to the school. The potential solution was to move to using the "SAS on demand" product, where the software is accessed online and sits on the SAS servers. SAS offers this through the SAS on demand for Academics (SODA) program and it is faster. However, as I ranted in another post, SODA currently can only load SAS datasets. And finally, SAS EM is extremely expensive outside of academia. The likelihood that my students would have access to it in their post-graduation job was therefore low.

I recently discovered Spotfire Miner (by TIBCO) and played around with it. Very fast and easy to learn, runs fast, and happily accepts a wide range of data file types. Cost for industry is currently $349/month. For use in the classroom it is free to both instructor and students! (as part of TIBCO's University Program).

I can't say much about IBM SPSS Modeler (previously known as Clementine) or StatSoft's STATISTICA Data Miner, except that after looking thoroughly through their websites I couldn't find any mention of pricing for academia or for industry. And I usually don't like the "request a quote" which tends to leave my mailbox full of promotional materials forever (probably the result of a data mining algorithm used for direct marketing!). Is the academic version identical to the full-blown version? is it a standalone installation or do you install it on a server?

For instructors who like extra materials: SAS offers a wealth of data mining teaching materials (you must contact them to receive the materials). StatSoft has a nice series of YouTube videos on different data mining topics and a brief PDF tutorial on data mining (they also have the awesome free Electronic Statistics Textbook which is a bit like an encyclopedia). I don't know of data mining teaching materials for the other packages (and couldn't find any on their websites).

It would be great to hear from other instructors and MBA students about their classroom (and post-graduation) experience with D&D software.

Wednesday, January 06, 2010

Creating map charts

With the growing amount of available geographical data, it is useful to be able to visualize one's data on top of a map. Visualizing numeric and/or categorical information on top of a map is called a map chart.

Two student teams in my Fall data mining class explored and displayed their data on map charts: one team compared economic, political, and well-being measures across different countries in the world. By linking a world map to their data, they could use color (hue and shading) to compare countries and geographical areas on those measures. Here's an example of two maps that they used. The top map uses shading to denote the average "well-being" score of a country (according to a 2004 Gallup poll), and the bottom map uses shading to denote the country's GDP. In both maps darker means higher.

Another team used a map to compare nursing homes in the US, in terms of quality of care scores. Their map below show the average quality of nursing home in each US State (darker means higher quality).These two sets of maps were created using TIBCO Spotfire. Following many requests, here is an explanation of how to create a map chart in Spotfire. Once you have your ordinary data file open, there are 3 steps to add the map component:
  1. Obtain the threesome of "shapefiles" needed to plot the map of interest: .shp file, .dbf file, and .shx file (see Wikipedia for an explanation of each)
  2. Open the shapefile in Spotfire (Open>New Visualization> Map Chart, then upload the shp file in Map Chart Properties> Data tab> Map data table)
  3. Link the map table to your data table using the Map Chart Properties> Data tab > Related data table for coloring (you will need a unique identifier linking your data table with the map table)
The tricky part is obtaining shapefiles. One good source with free files is Blue Marble Geographics (thanks to Dan Curtis for this tip!). For US state and county data, shapefiles can be obtained from the US Census Bureau website (thanks to Ben Meadema for this one!) I'm still in search for more sources (for Europe and Asia, for instance).

I thank Smith MBA students Dan Curtis, Erica Eisenhart, John Geraghty and Ben Meadema for their contributions to this post.

Tuesday, December 22, 2009

My newest batch of graduating data mining MBAs


Congratulations to our Smith School's Fall 2009 "Data Mining for Business" students. I look forward to hearing about your future endeavors -- use data mining to do good!

Saturday, December 12, 2009

Stratified sampling: why and how?

In surveys and polls it is common to use stratified sampling. Stratified sampling is also used in data mining, when drawing a sample from a database (for the purpose of model building). This post follows an active discussion about stratification that we had in the "Scientific Data Collection" PhD class. Although stratified sampling is very useful in practice, the explanation of why to do it and how to do it usefully is not straightforward; this stuff is only briefly touched upon in basic stats courses. Looking at the current Wikipedia entry further supports the knowledge gap.

What is stratifying? (that's the easy part)
Let's start by mentioning what an ordinary (not stratified) sample is: a "simple random sample" of size n means that we draw n records from the population at random. It's like drawing the numbers from a bag in Bingo.
Stratifying a population means dividing it into non-overlapping groups (called strata), where each unit in the population belongs to exactly one stratum. A straightforward example is stratifying the world's human inhabitants by gender. Of course various issues can arise such as duplications, but that's another story. A stratified (random) sample then means drawing a simple random sample from each stratum. In the gender example, we'd draw a simple random sample of females and a simple random sample of males. The combined samples would be our "stratified sample".

Why stratify?
The main reason for stratifying is to improve the precision of whatever we're estimating. We could be interested in measuring the average weight of 1-year old babies in a continent; the proportion of active voters in a country; the difference between the average salary of men and women in an industry; the change in the percent of overweight adults after opening the first MacDonalds in a country (compared to the percent beforehand).

Because we are estimating a population quantity using only a sample (=a subset of the population), there is some inaccuracy in our sample estimate. The average weight in our sample is not identical to the average weight in the entire population. As we increase the sample size, a "good" estimate will become more precise (meaning that its variability from sample to sample will decrease). Stratifying can help improve the precision of a sample estimate without increasing the sample size. In other words, you can get the same level of precision by either drawing a larger simple random sample, or by drawing a stratified random sample of a smaller size. But this benefit will only happen if you stratify "smartly". Otherwise there will be no gain over a simple random sample.

How to stratify smartly?
This is the tricky part. The answer depends on what you are trying to measure.

If we are interested in an overall population measure (e.g., a population average, total or proportion), then the following rule will help you benefit from stratification:Create strata such that each stratum is homogeneous in terms of what's being measured.

Example: If we're measuring the average weight of 1-year-old babies in a continent, then stratifying by gender is a good idea: The boys' stratum will be more homogeneous in terms of weight compared to mixing boys and girls (and similarly the girls' stratum will be homogeneous in terms of weight). What are other good stratifying criteria that would create groups of homogeneous baby weights? How about country? the parents' weights?

If we are interested in comparing measures of two populations, then the same idea applies, but requires more careful consideration: Create strata such that each stratum is homogeneous in terms of the difference between the two population measures.

Example: To compare the % of overweight adults in a country before and after opening the first MacDonalds, stratification means finding a criterion that creates strata that are homogeneous in terms of the difference of before/after weight. One direction is to look for populations who would be affected differently by opening the MacDonalds. For example, we could use income or some other economic status measure. If in the country of interest MacDonalds is relatively cheap (e.g., the US), then the weight difference would be more pronounced in the poor stratum; in contrast, if in the country of interest MacDonalds is relatively expensive (e.g., in Asia), then the weight difference would be less pronounced in the poor stratum and more pronounced in the wealthy stratum. In either country, using economic status as a stratifying criterion is likely to create strata that are homogeneous in terms of the difference of interest.

In data mining, taking a stratified sample is used in cases where a certain class is rare in the population and we want to make sure that we have sufficient representation of that class in our sample. This is called over-sampling. A classic example is in direct mail marketing, where the rate of responders is usually very low (under 1%). To build a model that can discriminate responders from non-responders usually requires a minimum sample of each class. In predictive tasks (such as predicting the probability of a new person responding to the offer) the interest is not directly in estimating the population parameters. Yet, the precision of the estimated coefficients (i.e., their variance) influences the predictive accuracy of model. Hence, oversampling can improve predictive accuracy by again lowering the sampling variance. This conclusion is my own, and I have not seen mention of this last point anywhere. Comments are most welcome!

Friday, November 06, 2009

The value of p-values: Science magazine asks

My students know how I cringe when I am forced to teach them p-values. I have always felt that their meaning is hard to grasp, and hence they are mostly abused when used by non-statisticians. This is clearly happening in research using large datasets, where p-values are practically useless for inferring practical importance of effects (check out our latest paper on the subject, which looks at large-dataset research in Information Systems).

So, when one of the PhD students taking my "Scientific Data Collection" course stumbled upon this recent Science Magazine article "Mission Improbable: A Concise and Precise Definition of P-Value" he couldn't resist emailing it to me. The article showcases the abuse of p-values in medical research due their illusive meaning. This is not even with large samples! Researchers incorrectly interpret the meaning of a p-value to be the probability of an effect rather than its statistical significance. The result of such confusion can clearly be devastating when the issue at stake is the effectiveness of a new drug or vaccine.

There are obviously better ways for assessing statistical significance, which are better aligned with practical significance and are also less ambiguous than p-values. One is confidence intervals. You get an estimate of your effect plus/minus some margin. You can then evaluate what the interval means practically. Another approach (good to try both) is to test predictive accuracy of your model, to see whether the prediction error is at a reasonable level -- this is achieved by applying your model to new data, and evaluating how well it fits those new data.

Shockingly enough, people seem to really want to use p-values, even if they don't understand them. I recently was involved in designing materials for a basic course on statistics for engineers and managers in a big company. We created an innovative and beautiful set of slides, with real examples, straightforward explanations, and practical advice. The 200+ slides did not have mention of p-values, but rather focused on measuring effects, understanding sampling variability, standard errors and confidence intervals, seeing the value of residual analysis in linear regression, and learning how to perform and evaluate prediction. Yet, we were requested by the company to replace some of this material ("not sure if we will need residual analysis, sampling error etc. our target audience may not use it") with material on p-values and on the 0.05 threshold ("It will be sufficient to know interpreting the p-value and R-sq to interpret the results"). Sigh.
It's hard to change a culture with such a long history.