Stock performance and CEO house size study

The BusinessWeek article “The CEO Mega-Mansion Factor” (April 2, 2007) definitely caught my attention — Two finance professors (Liu and Yermack) collected data on house sizes of CEOs of the S&P 500 companies in 2004. Their theory is “If home purchases represent a signal of commitment by the CEO, subsequent stock performance of the company should at least remain unchanged and possibly improve. Conversely, if home purchases represent a signal of entrenchment, we would expect stock performance to decline after the time of purchase.” The article summarizes the results: “[they] found that 12% of [CEOs] lived in homes of at … Continue reading Stock performance and CEO house size study

Google purchases data visualization tool

Once again, some hot news from my ex-student Adi Gadwale: Google recently purchased a data visualization tool from Professor Hans Rosling at Stockholm’s Karolinska Institute (read the story). Adi also sent me the link to Gapminder, the tool that Google has put out. For those of us who’ve become addicts of the interactive visualization tool Spotfire, this looks pretty familiar! Continue reading Google purchases data visualization tool

Multiple Testing

My colleague Ralph Russo often comes up with memorable examples for teaching complicated concepts. He recently sent me an Economist article called “Signs of the Times” that shows the absurd results that can be obtained if multiple testing is not taken into account. Multiple testing arises when the same data are used simultaneously for testing many hypotheses. The problem is a huge inflation in the type I error (i.e., rejecting the null hypothesis in error). Even if each single hypothesis is carried out at a low significance level (e.g., the infamous 5% level), the aggregate type I error becomes huge … Continue reading Multiple Testing

Source for data

Adi Gadwale, a student in my 2004 MBA Data Mining class, still remembers my fetish with business data and data visualization. He just sent me a link to an IBM Research website called Many Eyes, which includes user-submitted datasets as well as Java-applet visualizations. The datasets include quite a few “junk” datasets, lots with no description. But there are a few interesting ones: FDIC is a “scrubbed list of FDIC institutions removing inactive entities and stripping all columns apart from Assets, ROE, ROA, Offices (Branches), and State”. It includes 8711 observations. Another is Absorption Coefficients of Common Materials – I … Continue reading Source for data

Accuracy measures

There is a host of metrics for evaluating predictive performance. They are all based on aggregating the forecast errors in some form. The two most famous metrics are RMSE (Root-mean-squared-error) and MAPE (Mean-Absolute-Percentage-Error). In an earlier posting (Feb-23-2006) I disclosed a secret deciphering method for computing these metrics. Although these two have been the most popular in software, competitions, and published papers, they have their shortages. One serious flaw of the MAPE is that zero counts contribute to the MAPE the value of infinity (because of the division by zero). One solution is to leave the zero counts out of … Continue reading Accuracy measures

Lots of real time series data!

I love data-mining or statistics competitions – they always provide great real data! However, the big difference between a gold mine and “just some data” is whether the data description and their context is complete. This reflects, in my opinion, the difference between “data mining for the purpose of data mining” vs. “data mining for business analytics” (or any other field of interest, such as engineering or biology). Last year, the BICUP2006 posted an interesting dataset on bus ridership in Santiego de Chile. Although there was a reasonable description of the data (number of passengers at a bus stations at … Continue reading Lots of real time series data!