Now that the emotional storm following the American Statistical Association’s statement on p-values is slowing down (is it? was there even a storm outside of the statistics area?), let’s think about a practical issue. One that greatly influences data analysis in most fields: statistical software. Statistical software influences which methods are used and how they are reported. Software companies thus affect entire disciplines and how they progress and communicate. Star notation for p-value thresholds in statistical software No matter whether your field uses SAS, SPSS (now IBM), STATA, or another statistical software package, you’re likely to have seen the star … Continue reading Statistical software should remove *** notation for statistical significance
The parallel coordinate plot is useful for visualizing multivariate data in a dis-aggregated way, where we have multiple numerical measurements for each record. A scatter plot displays two measurements for each record by using the two axes. A parallel coordinate plot can display many measurements for each record, by using many (parallel) axes – one for each measurement. While not as popular as other charts, it sometimes turns out to be useful, so it’s good to have it in the visualization toolkit. Software such as TIBCO Spotfire and XLMiner include the parallel coordinate plot. There’s even a free Excel add-on. But … Continue reading Parallel coordinate plot in Tableau: a workaround
In business schools it is common to teach statistics courses using Microsoft Excel, due to its wide accessibility and the familiarity of business students with the software. There is a large debate regarding this practice, but at this point the reality is clear: the figure that I am familiar with is about 50% of basic stat courses in b-schools use Excel and 50% use statistical software such as Minitab or JMP. Another trend is moving from offline software to “cloud computing” — Software such as www.statcrunch.com offer basic stat functions in an online, collaborative, social-networky style. Following the popularity of … Continue reading Google Spreadsheets for teaching probability?
I just learned of the new Prediction API by Google — in brief, you upload a training set with up to 1 million records and let Google’s engine build an algorithm trained on the data. Then, upload a new dataset for prediction, and Google will apply the learned algorithm to score those data. On the user’s side, this is a total blackbox since you have no idea what algorithms are used and which is chosen (probably an ensemble). The predictions can therefore be used for utility (accurate predictions). For researchers, this is a great tool for getting a predictive accuracy … Continue reading Google’s new prediction API
I am following up on two earlier posts regarding using SAS On Demand for Academics. The version of EM has been upgraded to 6.1, which means that I am now able to upload and reach non-SAS files on the SAS Server – hurray! The process is quite cumbersome, and I do thank my SAS programming memory from a decade ago. Here’s a description for those instructors who want to check it out (it took me quite a while to piece all the different parts and figure out the right code): Find the directory path for your course on the SAS … Continue reading SAS On Demand Take 3: Success!
The drag-and-drop (D&D) concept in data mining tools is very neat. You “drag” icons (aka “nodes”) that do different operations, and “connect” them to create a data mining process. This is also called “graphical programming”. What I especially like about it is that it keeps the big picture in your mind rather than getting blinded by analysis details. The end product is also much easier to present and document. There has been quite a bonanza lately with a few of the major D&D data mining software tools. Clementine (by SPSS – now IBM) is now called “IBM SPSS Modeler“. Insightful … Continue reading Drag-and-drop data mining software for the classroom
I’ve recently had interesting discussions with colleagues in Information Systems regarding testing directional hypotheses. Following their request, I’m posting about this apparently illusive issue. In information systems research, the most common type of hypothesis is directional, i.e. the parameter of interest is hypothesized to go in a certain direction. An example would be testing the hypothesis that teenagers are more likely than older folks to use Facebook. Another example is the hypothesis that higher opening bids on eBay lead to higher final prices. In the Facebook example, the researcher would test the hypothesis by gathering data on Facebook usage by … Continue reading Testing directional hypotheses: p-values can bite
Following up on my previous posting about using SAS Enterprise Minder via the On Demand platform: From continued communication with experts at SAS, it turns out that with the EM version 5.3, which is the one available through On Demand, there is no way to work (or even access) non-SAS files. Their suggestion solution is to use some other SAS product like SAS BASE, or even SAS JMP (which is available through the On Demand platform) in order to convert your CSV files to SAS data files… From both a pedagogical and practical point of view, I am reluctant to … Continue reading SAS On Demand: Enterprise Miner — Update
I am in the process of trying out SAS Enterprise Miner via the (relatively new) SAS On Demand for Academics. In our MBA data mining course at Smith, we introduce SAS EM. In the early days, we’d get individual student licenses and have each student install the software on their computer. However, the software took too much space and it was also very awkward to circulate a packet of CDs between multiple students. We then moved to the Server option, where SAS EM is available on the Smith School portal. Although it solved the individual installation and storage issues, the … Continue reading SAS On Demand: Enterprise Miner
Scatterplots are extremely popular and useful graphical displays for examining the relationship between two numeric variables. They get even better when we add the use of color/hue and shape to include information on a third, categorical variable (or we can use size to include information on an additional numerical variable, to produce a “bubble chart”). For example, say we want to examine the relationship between the happiness of a nation and the percent of the population that live in poverty conditions — using 2004 survey data from the World Database of Happiness. We can create a scatterplot with “Happiness” on … Continue reading Creating color-coded scatterplots in Excel: a nightmare