BusinessWeek touched upon a sensitive issue in the article If You’re Cheating on Your Taxes. It’s about federal and state agencies using data mining to find “the bad guys”.
Although “data mining” is the term used in many of these stories, a more careful look reveals that there are hardly any advanced statistical/DM methods involved. The issue is the linkage/matching of different data sources. In the Statistical Challenges & Opportunities in eCommerce symposium last year, Stephen Fienberg, a professor of statistics at CMU and an expert on disclosure limitation showed a semi-futuristic movie on a pizza parlor using an array of linked datasets to “customize” a delivery call (from the American Civil Liberties Union website). He also wrote a paper on privacy and data mining1 that will come out soon in a special issue of the journal Statistical Science on the same topic (OK, I’ll disclose that I co-edited this with Wolfgang Jank).
Another interesting document can be found on the American Statistical Association’s website: FAQ Regarding the Privacy Implications of Data Mining.
The bottom line is that statistical/data mining methods or tools are not the evil. In fact, in some cases statistical methods allow the exact opposite: disclosing data in a way that allows inference but conceals any information that might breach privacy. This area is called Disclosure Limitation and is studied mainly by statisticians, operations researchers, and computer scientists.
1 “Privacy and Confidentiality in an E-Commerce World: Data Mining, Data Warehousing, Matching, and Disclosure Limitation,” S E Fienberg (2006), Statistical Science, special issue on “Statistical Challenges and Opportunities in eCommerce”, forthcoming.