Wednesday, April 19, 2017

Scala :The Bridge Language

The Fragmented World of Languages

Lots of changes in Data Science have happened in its every dimension: applications, algorithms and techniques, software and ,of course, languages. Languages tell a fascinating story because it is a reflection of the nature and the state-of-mind of the practitioners. Not so surprising, they have changed a lot over the years.

When I first started getting interested in Data Science, SAS and Matlab were still sure bets. A few years in (2013 or so), R became the Lingua Franca around me: easy to code and understand, the vectorized calculations backed by the DataFrame API made it very practical for non-CS practitioners (read statisticians, engineering generalists like myself) to use. It did away with a lot of the lower level considerations and ended up making a simpler interface, predictably at the expenses of the CS crowd, loathing such abstractions.

Today, I think we are at another junction where the ball is moving in the opposite direction: the CS languages that were catering to the Computer Scientist community are now becoming increasingly easy to use, its most prominent member being Martin Odersky's Scala. Let's dive in.


Scala: the Bridge Language

My personal experience with Scala goes back to Scala in 2015 when I wanted to better understand Apache Spark. The language is touted as a user-friendly alternative to Java (both running on the JVM) that also has lots of additional features like macros and Functional Programming constructs. Right from the get go, the language strikes by its ease of use (the REPL really helps) and its clean syntax.

Scala is unique because it can both fulfill the needs of Programmers and Data Scientists: It is very much at the junction of the 2 worlds and is therefore, what I call a "Bridge Language" in that sense. Smart companies, e.g. LinkedIn, Square and Spotify to name a few, have been fast at understanding the benefits of it: more productivity, more collaboration and more readability. All the sudden, the code did not need rewriting for production use !

What Scala can do for you:

Scala can use any Java project which makes it easy to leverage all the work that has taken place in Java for the last 2 decades. If you don't know Java, this is a good entry door because Scala is more user friendly. If you know Java, Scala is worth learning to become more productive. 

Besides, additional Scala libraries bring top notch code to you:

Data Science: 
  • Apache Spark: The ultimate tool for Big Data Processing with its MLLib for Data Science
  • Apache Kafka: The go-to tool for stream distribution and processing
  • Apache Akka: Actors (quite low level but easy to use)
Web Dev:
  • Play Framework: Modern Web Applications for everybody 
  • Slick: Functional connector to databases
Those libraries are very well documented. You can usually get started in minutes even if you're not an expert in the field.

How to get started:

I think getting your feet wet is the biggest hurdle - mostly psychological but there are also some technical difficulties that can be hard to overcome if you don't follow the right sequence. Here is my recommended guide to nail it:

  1. I would start with the Martin's Coursera class. It is a great intro to Functional Programming (which you will encounter often in Scala) and it also features great materials about the installation. Try to do the whole specialization if you have time. I would encourage you to use IntelliJ for your IDE, a better alternative to Eclipse.
  2. If you're a Data Scientist, try to learn more about Data Structures using Scala. I followed the Data Structure and Algo specialization on Coursera, a $420 investment you won't regret. If you are a Software Engineer, chances are you already have this knowledge (but it could be a good refresher)
  3. Choose a side project to hone your skills. Here are a few ideas:
    1. Create a web service with Play
    2. Put together a Data Pipeline with Spark
    3. Replace a small Java App with Scala (or code your next Java App with Scala)
Good luck !

Have experience using Scala ? Let me know what you think ! 

Thursday, October 8, 2015

The future of Purchase Attribution

The Online Ad industry is alive and well. A specialist firm reports double digit growth where Total Display is expected to amount to over $90B worldwide by 2017, a roaring 18% CACR since 2014.

With this influx of investment, Marketers will expect to see improvements in measurement methodologies and we are already seeing sign of it in in this area. 

The perennial problem of Purchase Attribution

Let's take an example to illustrate this famous problem: Betty is an avid internet user. As she is showing signs of future Apparel purchase, H&M buys her address and send her a promotional email on Day 0. Two days later, she is targeted a second time on her mobile while doing a search on Google. Finally, on Day 4, she connects to Facebook and clicks one of those Carousel ad leading to a purchase on H&M site.

Who should get the credit for the purchase ? Facebook? 

The Past: Last click

The simplest (and now outdated) attribution methodology was to give credit to the last ad, thus its name "Last Click". While simple and easily understandable by everyone, it has the flaw of not giving any credit to 2 first interactions although they might have been necessary to initiate Betty's interest in H&M.

The present: Time Decay

A few years ago, Google Analytics introduced many alternative attribution methodologies including Time Decay. This method gives more credits to more recent events. Consequently, Paid Email and Mobile Ads get respectively $20 and $30 credits in this illustrated example. A good step forward! 

That said, the fundamental problem remains: What would have happened had the Marketer not bought the FB ad? How about not buying the mobile ad? Or the Paid email? All in all, what would be the best combination of marketing stimuli here?

If the Time Decay methodology holds true, it can be implied that the purchase would not have been made in 50% of the cases, had the Marketer not bought the FB ad. Is that accurate? Can we find a better way to estimate the true effect of each interaction. This is where Uplift comes into Play

The Future: Uplift

What happened to Direct Mail where more and more companies use Control Groups to measure the true incremental will undoubtedly roll into the Online Ad space in the future. This is an exciting time! 

We will start seeing systematic control groups on ads. For each interaction on the customer journey, the marketer will be able to assess the true effect of each ad, that is the delta between the group that was exposed to the ad and the group that was not (or exposed to a competitor's ad). This is very powerful as it will provide a view on effects as opposed to pure correlations. Plus, it is an accurate and scientific way as opposed to a highly subjective methodology making use of parameters, e.g. the time decay coefficient, that are set arbitrarily.  

What it means for you

Uplift Measurement is a game changer. It will force ads providers to change their mindset from "being on the customer journey" to "altering the customer journey". It will certainly make ads more relevant and provide more targeted content. I also personally believe that behemoths like Facebook will face more pressure from the Marketers to get results. 

In addition, one may argue that it could decrease the "perceived value of ads". In the chart above, the amount to split between ads used to be $100 but is now only $10 (the outstanding $90 represent the amount that would be spent if no ad was displayed). Therefore, moving to an incremental approach could hamper online investments and force the Online Ad Industry to focus more on results.

Some Agencies have started working on this topic. I recently attend a pitch by Numberly and they are going right into that direction. The game is on! 

Tuesday, July 7, 2015

Uplift vs. Response: The Targeting Dilemma

Targeting is always reinventing itself. Who should be targeted for ads display? What should be the target customers for a direct mailing campaign or an email blast? This has been quite puzzling to Marketers in recent years. Software providers and Marketing Agencies have been quick to seize this opportunity to offer a solution to it: Uplift Modeling. Let's deep dive into what this new realm exactly means.

1- What is Uplift:

Let's look at 2 groups of customers to get familiar with the concept. 

A pure response model would dictate to target the segment with the highest Response Rate. Now, is it the right approach?

One may argue that maximum effectiveness, or Uplift, is a better criterion since 5-2 = 3 more customers would convert every 100 customers. 

So, Uplift is really the art of maximum efficiency by subtracting out the "base response". A corollary to this is to take out customers that are inelastic, either they cannot be convinced to take the offer (the lost cause) OR they will shop anyway (the die hard). 

2- What is required to model Uplift?

a. Raise Awareness:

The first requirement is to realize that Response is not a good KPI. In my experience, it is sometimes difficult to convince management that their Marketing Department has been using the wrong KPI for such a long time. Plus, switching from a measurable to a latent metric can be quite daunting! Can Uplift be measured anyway?

b. Be Proactive at measuring Uplift:

With management support, the team can move to start creating a performance report. What is the actual share of Demand of of total Response? This is the realm of A/B testing which is a quite well studied topic. A typical "Gotcha" is the size of the Control Group. Too small of a control group means that the measurement may be misleading due to uncertainty (see Interval of Confidence) so one should be generous with volume in the learning stage to ensure proper measurements. There are tools out there to calculate the right size.

c. Model Uplift:

 Now the team has a good handle of the current performance of your selection process, it is now time to improve it using a modeling approach.  

The model needs a data from a randomized experiment, that is sending a Marketing Treatment randomly. As you may expect, this data can be expensive to generate but it will unlock insights that will not be identified otherwise.

The objective of the model is to segregate customer groups based on their Uplift. For instance, the tree below shows 4 distinct groups with Uplift random from 0% to 10%. A smart Marketer would first target the best group with 10% uplift then work her way down to the second leaf at 8%, etc.

3- What options are available out there?

Now you are all excited about the Uplift Modeling, it's time to make the first step. I know only 2 options in the Market using a mainstream modeling environment. Truth is that the offering is quite limited.

Product  Owner / Author Resources
SAS Incremental Response Modeling SAS Enterprise http://support.sas.com/resources/papers/proceedings13/096-2013.pdf 
R Uplift Leo Guelman http://cran.r-project.org/web/packages/uplift/index.html

Then, there are plenty of vendors/agencies that do Uplift modeling like Portrait Software (product), the regular Marketing Agencies / Consultancies.

4- Further Reading

To explore more on the topic, here are a few articles:
  • An "OK" vulgarization article: http://www.dummies.com/how-to/content/basics-of-uplift-predictive-analytics-models.html 
  • A good research article http://stochasticsolutions.com/pdf/sig-based-up-trees.pdf
  • Another good article (by the author of the Uplift package in R): https://ideas.repec.org/p/bak/wpaper/201406.html 

Monday, May 25, 2015

Saving $12K on modeling jobs in AWS

Your AWS bill is going through the roof and you don't know what steps to take? Although lots of opinion leaders have voiced the somewhat flawed statement that "AWS is cheap", the reality is different where one needs to careful monitor costs since it became as simple as pushing a button to get extra resources.

AWS is still expensive:

AWS has lowered its price repeatedly  (42 times as of last year according to AWS) in the past years in a race-to-the-bottom with Google Cloud. That said, it can still run up to $17K a year for a large on-demand instance as the graph below shows:

Modeling jobs are "peak demand":

Modeling jobs need lots of resources for a short amount of time, days at the most. Consequently, they should be treated as peak demand which make them among the first candidates for Elastic Computing. Plus, it can easily be bundled as "Data + Instructions package" making it easy outsource it into another server.

Starcluster to the rescue:

StarCluster is an open-source utility software that was developed by the MIT. It was first developed for resource-hungry MIT students running Simulation Jobs but has not migrated to the Cloud, AWS in particular.

In practice, StarCluster allows the user to spin 100%-ready clusters on demand. As such, installations of requirement packages (like password-less SSH, Network File System for the infrastructure side and OpenMPI, OpenBLAS for the distibuted computing side) will be handled seamlessly by StarCluster so that the user focuses on the high-value tasks. Only 15-20min are needed to get a cluster of any size ready to crunch data!

Beside, the organization can put back the power into the users' hands by letting them setting up their cluster on-demand through a simple configuration file provided by StarCluster:


All things considered, the winning formula would be to consider down-scaling the current infrastructure and transfer the big modeling jobs to StarCluster. For instance, instead of having a 4xlarge, it would make sense having only 1 X-Large reserved instance for the day-to-day operations and offload the rest to a StarCluster setup. In the process, you would save up to $12K of expenditures.

Also noteworthy, it also comes with Python pre-installed (if you're more of a Python fan). 

Wednesday, March 18, 2015

What I like about Tableau 9.0 ...

Tableau is about to get a major overhaul and it looks very good. I was invited to the Beta and the final product pleasantly surprised me by its quality. It looks neat and sleek. But, what's new in the box? I will go through the improvements I liked.

1- The improved Look and Feel:

Right from the start, you can feel that the UX team has done some good work on the User Experience. At soon as you open it, you have that warm feeling. As a bonus, you can now save your datasource (which saves a good deal of time when starting a new project) and it would appear at the bottom left of the Welcome screen.


A major drawback of previous releases was the clutter that wide dataset would generate with 200+ data fields, in addition to the calculated fields. As of 9.0, Tableau introduces folders, a neat way to organize your data:

Very useful! 

Improved Calculated fields:
Now integrating auto-suggest (finally!), calculated fields are much easier to create and manage.

More responsive Tooltip:

The time you had to wait 3 seconds to have the tooltip pop-up is gone. Now, it is displayed shifty and provide a better interaction and rivals with D3.js in terms of interactivity.

2- Data Wrangling made easier:

We all had this moment when we forgot to split a field and ended up creating 2 calculated fields as we did not want to regenerate the table. Tableau now integrates this simple yet useful functionality in the data source view. Under the hood, the instructions are passed through the driver and translated into SQL which means that the loss in performance is minimal.

3- Analytics Pane - A quick access to simple Analytics:

Nothing really new here per say but I found that the Analytics were quite hard to reach in previous releases. Now, they are centralized in 1 pane and can be dragged and dropped. 

4- Other improvements:

  • Extended Support for statistical files: You can now read R files (.RData) and SPSS files (.sav)
  • Connector to SparkSQL: It was released in the 8.3 (although not visible to everyone) and is now accessible but everyone. It makes sense as DataBricks just released its Spark 1.3.0 with major upgrades on SparkSQL.
  • Map: Geography Lookup:You can now lookup cities, states or countries on a map and Tableau will zoom-in for you.
  • Parallel SQL Queries: I think this feature has some good and bad. For MPP/Hadoop databases, all requests are processed in parallel so it will not be helpful. For mature SQL databases (e.g. Oracle R11, SQL Server or PostgreSQL), it makes sense when most of the time spent is on the database side (which is rarely the case in my experience)
  • Performance enhancement: Although I have tested it, it seems that the team spent time doing CPU-level optimization (Data Engine Vectorization) and Parallel Aggregation

5-What is still missing:

  • Dynamic Parameters: I was honestly expecting it to be part of this release. Parameters are static lists by constructions and you can't use them as action filters. The Tableau Community has been pushing for it but it seems it was delayed. Let's hope it will be part of the next shipment! 
  • Easier Excel-like Conditional Formatting: Another popular community request. There are work-arounds but it would enhance the tables greatly and enable score-cards, a popular reporting tool. 


Tableau has put some good work in this release, especially on making the front-end easier to use and more intuitive. I think it's part of Tableau strategy to go after more business users, where the market
is. This has proven to be the right move so far for them.

Sunday, February 22, 2015

Why Spark is a game changer in Big Data and Data Science

This second part of this post is technical and requires an understanding of a classification algorithm.

The Business Part:

Hadoop was born in 2004 and conquered the world of top-end software engineering in the Valley. Used primarily to process logs, it had to be reinvented in order to attract business users.
In 2014, I noticed a shift in the industry with the ascent of “Hadoop 2.0”: More approachable for business users (e.g.  Cloudera Impala) and faster, it is bound to overtake Hadoop as we know it. Spark has been at the forefront this revolution and has provided a general purpose Big Data environment.  
Hadoop Spark comes with a strong value proposition:
  • It's fast (10-100X vs. Map Reduce)
  • It's scalable (I would venture in saying that it can scale for 99.99% of the companies out there)
  • It's integrated (for instance, it's possible to run SQL / ML algorithms in the same script)
  • It's flexible
Spark's Components- (Credit: spark.apache.org)

The flexibility is a key differentiator for me. SQL is an excellent language for data management but is quite restricted and does not allow complex aggregation queries. Spark does. Plus, Spark has 3 language API (Java, Scala, Python) in order to cater to a larger audience. It seems that the lesson was learnt from Hadoop 1.0 which was only be appealing to Software Engineers with excellent command of Java.

Because of all of the above, I see Spark as "Big Data for the masses". Data Scientists can use this tool effectively.

Now, let's get our hands dirty to demonstrate the flexibility point. We will be using Spark Python API.

The Technical Part:

Let's assume we have a big classification model trained with Spark ML Lib. Unfortunately, some performance measurement are not implemented so we need to figure out a way to compute efficiently the ROC Curve using all our 100M+ datapoints. This rule out scikit-learn or any local operation. 

ROC Curve  Credit: Wikipedia

Step A: Identify the Inputs:

The performance will be measured on the training set. The data can be shaped into you have 2 columns: Actual labels and model prediction (between 0 and 1):

labelsAndPreds = testing.map(lambda p: (p.label, model.predict(p.features)))

Note: map apply the function to each individual row of your dataset (named RDD in Spark).

This is how it looks like:

In order to draw this curve, you will also need the points on it (aka. the Operating Thresholds). These are level of sensitivity of the classification algorithm, from 0 (it would flag everyone ) to 1 (no-one). Let's generate this list and send a copy to the slaves ("broadcast" in Spark's lingua):

operating_threshold_bc = sc.broadcast(np.arange(0,1,.001))

Here is the strategy moving forward:

Step B: Define the class “ROC_Point” and its behavior:

The ROC_Point represents 1 point on the Receiver Curve. We will create a Python class to store the important information necessary to locate the ROC Point.

The code looks like that:

The class has 2 methods:

  1. The initialization that we will use in a map statement: For instance, ROC_Point(True,True) will return an instance of the class with 1 true positive, the 3 other being 0.
  2. The aggregation (“reducer”) that can return 1 ROC_Point from 2 ROC_Points. Basically, a way to tally your KPIs during the aggregation

Step C: Apply the logic:

1- Map Job:

For any operating threshold, we initiate a ROC_Point this way:


Looping over the list of operating thresholds, we get to:

labelsAndPreds_Points = labelsAndPreds.map(lambda (label,prediction):  [ROC_Point(label==1,prediction>threshold)
for threshold in operating_threshold_bc.value])

Let's look at the results:


We can see that the ROC_Point has properly saved the information and we're now reading for the aggregation.

2-      Reduce Job
Now we will utilize the reducer we created.
For each threshold, I will add the 2 ROC Points together, summing their tallies (of FP,TP, FN,TN:

labelsAndPreds_ROC_reduced =
lambda l1,l2:  [ ROC_1.ROC_add(ROC_2) for ROC_1,ROC_2 in zip(l1,l2) ] )

Let’s look at the results:

This is when the algorithm flags everyone:

At the other opposite, this is when the algorithm does not flag anyone:

In 2 lines of code, we managed to calculate the ROC curve. Quite succinct, a real improvement over Hadoop 1.0! 

3-      ROC Curve
Now we need to draw the curve (the data is now sitting into your leader node and it will run locally):

Although still in active development with a lot of changes and improvements to be made, Spark is full of potential and it brings enormous flexibility to better and faster get to the Insight hidden in your data. I recommend keeping an eye on this up-and-coming tool. Specifically, I think Spark will be a game changer for Data Science and can encroach on R in the coming years. Note that MLLib attempt to keep the synthax consistent with scikit-learn which makes it easy to transition for the Python fans. What would be outside of the realm of possibility in R or would have taken hours to compute is now within reach with Spark.

Saturday, February 14, 2015

Marketing and the rule of the 1% improvement

Do you happen to know this man?

Credit: Wikipedia

His name is Dave Brailsford, a coach rendered famous for overhauling the British national Cycling Team  and, in only 6 years, bringing his team to victory at the 2004 Olympic Games in Athens and starting a tidal wave sweeping the 2 subsequent Olympic gatherings. Being the first Briton to win The Tour de France in 2012 made a legend of the man.

What is the magic sauce behind this success story? His secret was to take a holistic approach (Nutrition, Training, Lifestyle even psychology), then break down the training process into elementary steps and seek small incremental improvement for each with an expectation for it to snowball.

Quite naturally, one can apply the same concept to Marketing by breaking the Customer Lifecycle into small "bits" and looking at incremental improvement at each stage. 

To illustrate the point:

Simplified Customer Lifecyle

Each customer evolves over time. Let's look at a transition matrix with representative numbers of a medium-sized retailer:

Each line sums up to 100%. As an example, this matrix says that out of 100 loyal customers, 80 remain Loyal and 20 becomes at-risk

Now, let's do a projection to future dates using this transition matrix. In order to get a projection to 2 months we apply twice the Transition Matrix to the initial population; 3 time for 3 months, etc.

We do it twice:
- Under "normal conditions"
- Under "Brailsford conditions" that boosts the transition matrix by applying little 1% incremental improvement for each transition.

Here, the Brailsfold-inspired improvement is twofold: Getting more customers to Loyal as well as getting less at-risk or even lost.

The result is quite stunning. With our hypotheses, we get $560M of monthly demand vs. $467M, a 20% improvement! You will also notice that the company is much healthier by having a large share of its customer base as Loyal.

The bottom-line is that companies can look at quite significant results by launching multiple small initiatives across the Customer Lifecycle with intent of creating this snowball effect.

For those who want to check the code and rerun the analysis with their own numbers, it's here