Data Ethics Regulation: Two key updates in 2018

This year, two important new regulations will be impacting research with human subjects: the EU’s General Data Protection Regulation (GDPR), which kicks in May 2018, and the USA’s updated Common Rule, called the Final Rule, is in effect from Jan 2018. Both changes relate to protecting individuals’ private information and will affect researchers using behavioral data in terms of data collection, access, use, applications for ethics committee (IRB) approvals/exemptions, collaborations within the same country/region and beyond, and collaborations with industry. Both GDPR and the final rule try to modernize what today constitutes “private data” and data subjects’ rights and balance … Continue reading Data Ethics Regulation: Two key updates in 2018

Election polls: description vs. prediction

My papers To Explain or To Predict and Predictive Analytics in Information Systems Research contrast the process and uses of predictive modeling and causal-explanatory modeling. I briefly mentioned there a third type of modeling: descriptive. However, I haven’t expanded on how descriptive modeling differs from the other two types (causal explanation and prediction). While descriptive and predictive modeling both share the reliance on correlations, whereas explanatory modeling relies on causality, the former two are in fact different. Descriptive modeling aims to give a parsimonious statistical representation of a distribution or relationship, whereas predictive modeling aims at generating values for new/future observations. … Continue reading Election polls: description vs. prediction

Data mining algorithms: how many dummies?

There’s lots of posts on “k-NN for Dummies”. This one is about “Dummies for k-NN” Categorical predictor variables are very common. Those who’ve taken a Statistics course covering linear (or logistic) regression, know the procedure to include a categorical predictor into a regression model requires the following steps: Convert the categorical variable that has m categories, into m binary dummy variables Include only m-1 of the dummy variables as predictors in the regression model (the dropped out category is called the reference category) For example, if we have X={red, yellow, green}, in step 1 we create three dummies: D_red = … Continue reading Data mining algorithms: how many dummies?

What’s in a name? “Data” in Mandarin Chinese

The term “data”, now popularly used in many languages, is not as innocent as it seems. The biggest controversy that I’ve been aware of is whether the English term “data” is singular or plural. The tone of an entire article would be different based on the author’s decision. In Hebrew, the word is in plural (Netunim, with the final “im” signifying plural), so no question arises. Today I discovered another “data” duality, this time in Mandarin Chinese. In Taiwan, the term used is 資料 (Zīliào), while in Mainland China it is 數據 (Shùjù). Which one to use? What is the … Continue reading What’s in a name? “Data” in Mandarin Chinese

An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

A while ago I wrote about the wonderful new age of real-data-made-available-for-academic-use through the growing number of data mining contests on platforms such as Kaggle and CrowdANALYTIX. Such data provide excellent examples for courses on data mining that help train the next generation of data scientists, business analysts, and other data-savvy graduates.  A couple of years later, I am discovering the painful truth that many of these dataset are no longer available. The reason is most likely due to the company who shared the data pulling their data out. This unexpected twist is extremely harmful to both academia and industry: instructors … Continue reading An Appeal to Companies: Leave the Data Behind @kaggle @crowdanalytix_q

Data liberation via visualization

“Data democratization” movements try to make data, and especially government-held data, publicly available and accessible. A growing number of technological efforts are devoted to such efforts and especially the accessibility part. One such effort is by data visualization companies. A recent trend is to offer a free version (or at least free for some period) that is based on sharing your visualization and/or data to the Web. The “and/or” here is important, because in some cases you cannot share your data, but you would like to share the visualizations with the world. This is what I call “data liberation via … Continue reading Data liberation via visualization

Got Data?!

The American Statistical Association’s store used to sell cool T-shirts with the old-time beggar-statistician question “Got Data?” Today it is much easier to find data, thanks to the Internet. Dozens of student teams taking my data mining course have been able to find data from various sources on the Internet for their team projects. Yet, I often receive queries from colleagues in search of data for their students’ projects. This is especially true for short courses, where students don’t have sufficient time to search and gather data (which is highly educational in itself!). One solution that I often offer is … Continue reading Got Data?!

SAS On Demand: Enterprise Miner — Update

Following up on my previous posting about using SAS Enterprise Minder via the On Demand platform: From continued communication with experts at SAS, it turns out that with the EM version 5.3, which is the one available through On Demand, there is no way to work (or even access) non-SAS files. Their suggestion solution is to use some other SAS product like SAS BASE, or even SAS JMP (which is available through the On Demand platform) in order to convert your CSV files to SAS data files… From both a pedagogical and practical point of view, I am reluctant to … Continue reading SAS On Demand: Enterprise Miner — Update

SAS On Demand: Enterprise Miner

I am in the process of trying out SAS Enterprise Miner via the (relatively new) SAS On Demand for Academics. In our MBA data mining course at Smith, we introduce SAS EM. In the early days, we’d get individual student licenses and have each student install the software on their computer. However, the software took too much space and it was also very awkward to circulate a packet of CDs between multiple students. We then moved to the Server option, where SAS EM is available on the Smith School portal. Although it solved the individual installation and storage issues, the … Continue reading SAS On Demand: Enterprise Miner

Data Exploration Celebration: The ENBIS 2009 Challenge

The European Network for Business and Industrial Statistics (ENBIS) has released the 2009 ENBIS Challenge. The challenge this time is to use an exploratory data analysis (EDA) tool to answer a bunch of questions regarding sales of laptop computers in London. The data on nearly 200,000 transactions include 3 files: sales data (for each computer sold, with time stamps and zipcode locations of customer and store), computer configuration information, and geographic information linking zipcodes to GIS coordinates. Participants are challenged to answer a set of 11 questions using EDA. The challenge is sponsored by JMP (by SAS), who are obviously … Continue reading Data Exploration Celebration: The ENBIS 2009 Challenge