While huge datasets have become ubiquitos in fields such as genomics, large datasets are now also becoming to infiltrate research in the social sciences. Data from eCommerce sites, online dating sites, etc. are now collected as part of research in information systems, marketing and related fields. We can now find social science research papers with hundreds of thousands of observations and more. A common type of research question in such studies is about the relationship between two variables. For example, how does the final price of an online auction relate to the seller’s feedback rating? A classic exploratory tool for examining such … Continue reading Scatter plots for large samples
Being in Bhutan this year, I have requested the American Statistical Association (ASA) and INFORMS to mail the magazines that come with my membership to Bhutan. Although I can access the magazines online, I greatly enjoy receiving the issues by mail (even if a month late) and leafing through them leisurely. Not to mention the ability to share them with local colleagues who are seeing these magazines for the first time! Now to the data-analytic reason for my post: The main article in the August 2010 issue of AMSTAT News (the ASA’s magazine) on Fellow Award: Revisited (Again) presented an “update to … Continue reading ASA’s magazine: Excel’s default charts
Scatterplots are extremely popular and useful graphical displays for examining the relationship between two numeric variables. They get even better when we add the use of color/hue and shape to include information on a third, categorical variable (or we can use size to include information on an additional numerical variable, to produce a “bubble chart”). For example, say we want to examine the relationship between the happiness of a nation and the percent of the population that live in poverty conditions — using 2004 survey data from the World Database of Happiness. We can create a scatterplot with “Happiness” on … Continue reading Creating color-coded scatterplots in Excel: a nightmare
The European Network for Business and Industrial Statistics (ENBIS) has released the 2009 ENBIS Challenge. The challenge this time is to use an exploratory data analysis (EDA) tool to answer a bunch of questions regarding sales of laptop computers in London. The data on nearly 200,000 transactions include 3 files: sales data (for each computer sold, with time stamps and zipcode locations of customer and store), computer configuration information, and geographic information linking zipcodes to GIS coordinates. Participants are challenged to answer a set of 11 questions using EDA. The challenge is sponsored by JMP (by SAS), who are obviously … Continue reading Data Exploration Celebration: The ENBIS 2009 Challenge
Histograms are very useful charts for displaying the distribution of a numerical measurement. The idea is to bucket the numerical measurement into intervals, and then to display the frequency (or percentage) of records in each interval. Two ways to generate a histogram in Excel are: Create a pivot table, with the measurement of interest in the Column area, and Count of that measurement (or any measurement) in the Data area. Then, right-click the column area and “Group and Show Detail > Group” will create the intervals. Now simply click the chart wizard to create the matching chart. You will still … Continue reading Histograms in Excel
Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe’s example, by Basset et al. (1986):We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values. Here are the summary statistics for each of the four Y variables (A, B, C, D): A B C D Average 20.95 20.95 20.95 20.95 Std 1.495794 1.495794 … Continue reading Summaries or graphs?