While huge datasets have become ubiquitos in fields such as genomics, large datasets are now also becoming to infiltrate research in the social sciences. Data from eCommerce sites, online dating sites, etc. are now collected as part of research in information systems, marketing and related fields. We can now find social science research papers with hundreds of thousands of observations and more. A common type of research question in such studies is about the relationship between two variables. For example, how does the final price of an online auction relate to the seller’s feedback rating? A classic exploratory tool for examining such … Continue reading Scatter plots for large samples
Scatterplots are extremely popular and useful graphical displays for examining the relationship between two numeric variables. They get even better when we add the use of color/hue and shape to include information on a third, categorical variable (or we can use size to include information on an additional numerical variable, to produce a “bubble chart”). For example, say we want to examine the relationship between the happiness of a nation and the percent of the population that live in poverty conditions — using 2004 survey data from the World Database of Happiness. We can create a scatterplot with “Happiness” on … Continue reading Creating color-coded scatterplots in Excel: a nightmare
Herb Edelstein from Two Crows consulting introduced me to this neat example showing how graphs are much more revealing than summary statistics. This is an age-old example by Anscombe (1973). I will show a slightly updated version of Anscombe’s example, by Basset et al. (1986):We have four datasets, each containing 11 pairs of X and Y measurements. All four datasets have the same X variable, and only differ on the Y values. Here are the summary statistics for each of the four Y variables (A, B, C, D): A B C D Average 20.95 20.95 20.95 20.95 Std 1.495794 1.495794 … Continue reading Summaries or graphs?