Showing posts from October, 2014

The IRS and Data Science

Today, the NYT publishes an article about the IRS seizing bank accounts on suspicion. In short, the IRS intends to battle drug cartels by flagging accounts with a suspicious activity. The quote from the officer gives us an insight about their methodology: -  based on my training and experience, the pattern “is consistent with structuring.” In other words, Drug Cartels have a higher propensity to make deposits below $10K (legal ceiling before the bank reports it by law) thus the IRS flagged all of the accounts fitting this criterion. However, the stats reported by Institute for Justice are reported to be mediocre: Only one in five was prosecuted as a criminal structuring case  If innocent, the victims are left to prove it with all the red-tape it involves. The problem: This method simply does not work well for the following reasons: 1- A False Positive  (Seizing an account that did  not  belong to Drug Organization)  is very costly, from a reputation persp

The Big Divide: Business Analytics

One recurring challenge that always come back to me is the valuation of Modeling work. This is a critical step that is seldom properly done thus, bringing some doubt into the value of the work for lack of dollar visibility. I will try to tackle it in this article. A few weeks ago, this is the discussion I overheard about a propensity model [think about a bank that wants to know if a customer will accept an offer or not]: - Business Leader : Ok, tell me how good your model is ... - Modeler : Pretty good, my ROC performance is .82 - Business Leader: What does that mean?  - Modeler: This is the area under the ROC curve, it means that the model is good at discriminating the prospects that will answer from the prospects that will NOT answer - Business Leader (impatient): I get that but can you translate into dollars? - Modeler: That depends on many parameters such as threshold, size of the scoring set ... This is quite revealing of the current Big Divide in the Analytics

Big Data: The Myth debunked

Nearly everyone is talking about Big Data and what used to be a Silicon Valley crusade has become an international quest to find value within corporation's data (through the use of what's called Data Science). The truth is that we're still very far from being there. 1.  The most widely used Software do not scale very well The Lingua Franca for Data Science is really R with an impressive  5,873 packages (you can find pretty much anything, such as Web-scraping, Natural Language Processing, Visualization, etc..). And  people like its ease of use  and natural syntax for non-programmers as the exponential growth of the number of packages has proved it: R is fabulous but by no means was it created for Large Scale processing. As a matter of fact, all the data needs to fit in memory, thus severely limiting its use for large scale applications. Its biggest content, the up-and-coming Python, which has experienced a great scientific contribution in the recent years (S