Ribbon

Sunday, June 16, 2019

5 rules for a productive Science team


Data Science is a new discipline. As such, companies (incl. Tech) are still trying to figure out the optimal configuration for the team. However, there are a few guiding principles that are important to follow so that your Science team does not fall apart. I have complied my own in this blog article and I'm sharing them with you to generate a discussion:




1- Have a single source of truth

During my consulting years, I cannot recall how many of my clients had challenges that emerged largely from a single problem: teams looking at different data. 

Aligning on metrics and methodology is very important as it forces the team(s) into a "single view of the world". Without agreement, team progress is hampered due to definition conflict, and confusion (when metrics disagree).

In practice, this can be achieved through a single "fact table", adequate documentation (with thorough definition and pointers to code) and availability of metrics where they matter (for instance, on your experimentation platform) so that nobody is tempted to create his own metrics !  

2- Prioritize Iteration speed 

Iteration speed eats theoretical work for breakfast. 

At tech companies, systems are big and complicated. As a result, it may be tempting to work on theoretical work and sacrifice iterations. 

Instead, I would advocate that you find a way to simplify the problem to a point where an iteration can be done rapidly (aim for an order of magnitude of 30-60min). On the data side, downsampling is a common technique. Looking at a single use case is another. On the code side, carving out the codebase so that it can be run into an interactive notebook (e.g. Jupyter) works too. Scaling to larger data / scope will come when your solution is ready ! 

3- Observability is everything

Lord Kelvin lived 200 years ago but one of his principles still applies well today: "If you can't measure it, you can't improve it". 

By defining your metrics properly, you are setting the direction of the team's effort. Further, by measuring everything, you are measuring the progress toward your target and enabling the team to diagnose problems along the way. 





4- Complexity is bad

Tech and Consulting have the resources to hire talented employees. This is generally good: Talented employees love to work with equally talented employees and build great products ! 

However, there is one downside: talented employees love to build complex stuff ! Complexity is often overlooked when the gain is palatable (or when the improvement is "fancy" - for instance, using the latest Neural Net config). Proper diligence on how much burden this is adding to iteration cycle time, debugging, and reproducibility is key to keeping the systems lean and the team moving.

5- Naming convention is surprisingly important

Surely, naming variables is no biggie, isn't it ? You can't be further from the truth ! Bad variable names makes the code more complicated to read and worse, introduce bugs ! 
  • Always specify units of measure: weight_kg is much better than weight
  • Be specific ! Don't name your variables: var or tmp (really, don't)
  • Indicating types with a suffix doesn't hurt for boolean: for instance is_valid_ind or is_valid_flag
  • Stick to your team conventions. For instance, 1/0 can be either coded with a boolean or with a tinyint

What are your rules ? Let's have a discussion ! 



Monday, June 4, 2018

Focus: Shapley value



Game theory is a fascinating topic codifying and quantifying all sorts of interactions between stakeholders in a game. The most popular setup is the prisoner's dilemma but there is much more to it. Today, we will cover the Shapley value as I recently stumbled across this original yet relatively unknown concept.

Problem at stake:


Example 1: In a salary negotiation, the employee showcases his skills and what it brings to the company. But how much are these skills worth ?
Example 2: In a Joint Venture, each founding company brings expertise. What’s a fair distribution of the ownership/shares ?
Example 3: 2 (or 3) TelCo companies want to build a fiber network that would benefit both. What’s a fair payment breakdown ?

When you think about these problems, most of us would tend to answer them through guesses: “You should ask for X% raise because you deserve it” but there is actually a theory for it

Let’s look at the theory behind.


Solution:

This is known as a cooperative game and there is only one breakdown function that fulfill a few conditions (more in this later).

The gist of it: “Player A fair reward is the average of his marginal contributions to the different coalitions leading to the final setup“ where:

For a game with 3 players (A,B,C), we denote:
-          Final setup: final set of stakeholders S {A,B,C}
-          Coalition: a subset of S
-          Marginal contribution: Adding A to {A,B} is : Value {A,B,C} – Value {B,C}

Now, let’s introduce value to the example. For instance:
-          V(A) = 12
-          V(B) = 8
-          V(C) = 2
-          V(A,B) = 22
-          V(A,C) = 15
-          V(B,C) = 11
-          V(A,B,C) = 23






The Sharpley value of A is calculated as:




Generalization

From there, you can intuit the general formula where n! is the total number of permutations :


Where K \ A notes the coalition K without A.



Wednesday, April 19, 2017

Scala :The Bridge Language



The Fragmented World of Languages

Lots of changes in Data Science have happened in its every dimension: applications, algorithms and techniques, software and ,of course, languages. Languages tell a fascinating story because it is a reflection of the nature and the state-of-mind of the practitioners. Not so surprising, they have changed a lot over the years.

When I first started getting interested in Data Science, SAS and Matlab were still sure bets. A few years in (2013 or so), R became the Lingua Franca around me: easy to code and understand, the vectorized calculations backed by the DataFrame API made it very practical for non-CS practitioners (read statisticians, engineering generalists like myself) to use. It did away with a lot of the lower level considerations and ended up making a simpler interface, predictably at the expenses of the CS crowd, loathing such abstractions.

Today, I think we are at another junction where the ball is moving in the opposite direction: the CS languages that were catering to the Computer Scientist community are now becoming increasingly easy to use, its most prominent member being Martin Odersky's Scala. Let's dive in.

                           

Scala: the Bridge Language

My personal experience with Scala goes back to Scala in 2015 when I wanted to better understand Apache Spark. The language is touted as a user-friendly alternative to Java (both running on the JVM) that also has lots of additional features like macros and Functional Programming constructs. Right from the get go, the language strikes by its ease of use (the REPL really helps) and its clean syntax.

Scala is unique because it can both fulfill the needs of Programmers and Data Scientists: It is very much at the junction of the 2 worlds and is therefore, what I call a "Bridge Language" in that sense. Smart companies, e.g. LinkedIn, Square and Spotify to name a few, have been fast at understanding the benefits of it: more productivity, more collaboration and more readability. All the sudden, the code did not need rewriting for production use !



What Scala can do for you:

Scala can use any Java project which makes it easy to leverage all the work that has taken place in Java for the last 2 decades. If you don't know Java, this is a good entry door because Scala is more user friendly. If you know Java, Scala is worth learning to become more productive. 

Besides, additional Scala libraries bring top notch code to you:

Data Science: 
  • Apache Spark: The ultimate tool for Big Data Processing with its MLLib for Data Science
  • Apache Kafka: The go-to tool for stream distribution and processing
Concurrency:
  • Apache Akka: Actors (quite low level but easy to use)
Web Dev:
  • Play Framework: Modern Web Applications for everybody 
  • Slick: Functional connector to databases
Those libraries are very well documented. You can usually get started in minutes even if you're not an expert in the field.

How to get started:

I think getting your feet wet is the biggest hurdle - mostly psychological but there are also some technical difficulties that can be hard to overcome if you don't follow the right sequence. Here is my recommended guide to nail it:

  1. I would start with the Martin's Coursera class. It is a great intro to Functional Programming (which you will encounter often in Scala) and it also features great materials about the installation. Try to do the whole specialization if you have time. I would encourage you to use IntelliJ for your IDE, a better alternative to Eclipse.
  2. If you're a Data Scientist, try to learn more about Data Structures using Scala. I followed the Data Structure and Algo specialization on Coursera, a $420 investment you won't regret. If you are a Software Engineer, chances are you already have this knowledge (but it could be a good refresher)
  3. Choose a side project to hone your skills. Here are a few ideas:
    1. Create a web service with Play
    2. Put together a Data Pipeline with Spark
    3. Replace a small Java App with Scala (or code your next Java App with Scala)
Good luck !

Have experience using Scala ? Let me know what you think ! 


Thursday, October 8, 2015

The future of Purchase Attribution



The Online Ad industry is alive and well. A specialist firm reports double digit growth where Total Display is expected to amount to over $90B worldwide by 2017, a roaring 18% CACR since 2014.



With this influx of investment, Marketers will expect to see improvements in measurement methodologies and we are already seeing sign of it in in this area. 

The perennial problem of Purchase Attribution

Let's take an example to illustrate this famous problem: Betty is an avid internet user. As she is showing signs of future Apparel purchase, H&M buys her address and send her a promotional email on Day 0. Two days later, she is targeted a second time on her mobile while doing a search on Google. Finally, on Day 4, she connects to Facebook and clicks one of those Carousel ad leading to a purchase on H&M site.

Who should get the credit for the purchase ? Facebook? 

The Past: Last click

The simplest (and now outdated) attribution methodology was to give credit to the last ad, thus its name "Last Click". While simple and easily understandable by everyone, it has the flaw of not giving any credit to 2 first interactions although they might have been necessary to initiate Betty's interest in H&M.


The present: Time Decay

A few years ago, Google Analytics introduced many alternative attribution methodologies including Time Decay. This method gives more credits to more recent events. Consequently, Paid Email and Mobile Ads get respectively $20 and $30 credits in this illustrated example. A good step forward! 

That said, the fundamental problem remains: What would have happened had the Marketer not bought the FB ad? How about not buying the mobile ad? Or the Paid email? All in all, what would be the best combination of marketing stimuli here?

If the Time Decay methodology holds true, it can be implied that the purchase would not have been made in 50% of the cases, had the Marketer not bought the FB ad. Is that accurate? Can we find a better way to estimate the true effect of each interaction. This is where Uplift comes into Play

The Future: Uplift

What happened to Direct Mail where more and more companies use Control Groups to measure the true incremental will undoubtedly roll into the Online Ad space in the future. This is an exciting time! 

We will start seeing systematic control groups on ads. For each interaction on the customer journey, the marketer will be able to assess the true effect of each ad, that is the delta between the group that was exposed to the ad and the group that was not (or exposed to a competitor's ad). This is very powerful as it will provide a view on effects as opposed to pure correlations. Plus, it is an accurate and scientific way as opposed to a highly subjective methodology making use of parameters, e.g. the time decay coefficient, that are set arbitrarily.  

What it means for you

Uplift Measurement is a game changer. It will force ads providers to change their mindset from "being on the customer journey" to "altering the customer journey". It will certainly make ads more relevant and provide more targeted content. I also personally believe that behemoths like Facebook will face more pressure from the Marketers to get results. 

In addition, one may argue that it could decrease the "perceived value of ads". In the chart above, the amount to split between ads used to be $100 but is now only $10 (the outstanding $90 represent the amount that would be spent if no ad was displayed). Therefore, moving to an incremental approach could hamper online investments and force the Online Ad Industry to focus more on results.

Some Agencies have started working on this topic. I recently attend a pitch by Numberly and they are going right into that direction. The game is on! 

Tuesday, July 7, 2015

Uplift vs. Response: The Targeting Dilemma





Targeting is always reinventing itself. Who should be targeted for ads display? What should be the target customers for a direct mailing campaign or an email blast? This has been quite puzzling to Marketers in recent years. Software providers and Marketing Agencies have been quick to seize this opportunity to offer a solution to it: Uplift Modeling. Let's deep dive into what this new realm exactly means.

1- What is Uplift:

Let's look at 2 groups of customers to get familiar with the concept. 

A pure response model would dictate to target the segment with the highest Response Rate. Now, is it the right approach?

One may argue that maximum effectiveness, or Uplift, is a better criterion since 5-2 = 3 more customers would convert every 100 customers. 

So, Uplift is really the art of maximum efficiency by subtracting out the "base response". A corollary to this is to take out customers that are inelastic, either they cannot be convinced to take the offer (the lost cause) OR they will shop anyway (the die hard). 


2- What is required to model Uplift?



a. Raise Awareness:

The first requirement is to realize that Response is not a good KPI. In my experience, it is sometimes difficult to convince management that their Marketing Department has been using the wrong KPI for such a long time. Plus, switching from a measurable to a latent metric can be quite daunting! Can Uplift be measured anyway?

b. Be Proactive at measuring Uplift:

With management support, the team can move to start creating a performance report. What is the actual share of Demand of of total Response? This is the realm of A/B testing which is a quite well studied topic. A typical "Gotcha" is the size of the Control Group. Too small of a control group means that the measurement may be misleading due to uncertainty (see Interval of Confidence) so one should be generous with volume in the learning stage to ensure proper measurements. There are tools out there to calculate the right size.

c. Model Uplift:

 Now the team has a good handle of the current performance of your selection process, it is now time to improve it using a modeling approach.  

The model needs a data from a randomized experiment, that is sending a Marketing Treatment randomly. As you may expect, this data can be expensive to generate but it will unlock insights that will not be identified otherwise.

The objective of the model is to segregate customer groups based on their Uplift. For instance, the tree below shows 4 distinct groups with Uplift random from 0% to 10%. A smart Marketer would first target the best group with 10% uplift then work her way down to the second leaf at 8%, etc.



3- What options are available out there?


Now you are all excited about the Uplift Modeling, it's time to make the first step. I know only 2 options in the Market using a mainstream modeling environment. Truth is that the offering is quite limited.

Product  Owner / Author Resources
SAS Incremental Response Modeling SAS Enterprise http://support.sas.com/resources/papers/proceedings13/096-2013.pdf 
R Uplift Leo Guelman http://cran.r-project.org/web/packages/uplift/index.html


Then, there are plenty of vendors/agencies that do Uplift modeling like Portrait Software (product), the regular Marketing Agencies / Consultancies.

4- Further Reading

To explore more on the topic, here are a few articles:
  • An "OK" vulgarization article: http://www.dummies.com/how-to/content/basics-of-uplift-predictive-analytics-models.html 
  • A good research article http://stochasticsolutions.com/pdf/sig-based-up-trees.pdf
  • Another good article (by the author of the Uplift package in R): https://ideas.repec.org/p/bak/wpaper/201406.html 






Monday, May 25, 2015

Saving $12K on modeling jobs in AWS


Your AWS bill is going through the roof and you don't know what steps to take? Although lots of opinion leaders have voiced the somewhat flawed statement that "AWS is cheap", the reality is different where one needs to careful monitor costs since it became as simple as pushing a button to get extra resources.


AWS is still expensive:

AWS has lowered its price repeatedly  (42 times as of last year according to AWS) in the past years in a race-to-the-bottom with Google Cloud. That said, it can still run up to $17K a year for a large on-demand instance as the graph below shows:






Modeling jobs are "peak demand":


Modeling jobs need lots of resources for a short amount of time, days at the most. Consequently, they should be treated as peak demand which make them among the first candidates for Elastic Computing. Plus, it can easily be bundled as "Data + Instructions package" making it easy outsource it into another server.

Starcluster to the rescue:


StarCluster is an open-source utility software that was developed by the MIT. It was first developed for resource-hungry MIT students running Simulation Jobs but has not migrated to the Cloud, AWS in particular.

In practice, StarCluster allows the user to spin 100%-ready clusters on demand. As such, installations of requirement packages (like password-less SSH, Network File System for the infrastructure side and OpenMPI, OpenBLAS for the distibuted computing side) will be handled seamlessly by StarCluster so that the user focuses on the high-value tasks. Only 15-20min are needed to get a cluster of any size ready to crunch data!

Beside, the organization can put back the power into the users' hands by letting them setting up their cluster on-demand through a simple configuration file provided by StarCluster:



Conclusion:

All things considered, the winning formula would be to consider down-scaling the current infrastructure and transfer the big modeling jobs to StarCluster. For instance, instead of having a 4xlarge, it would make sense having only 1 X-Large reserved instance for the day-to-day operations and offload the rest to a StarCluster setup. In the process, you would save up to $12K of expenditures.

Also noteworthy, it also comes with Python pre-installed (if you're more of a Python fan). 

Wednesday, March 18, 2015

What I like about Tableau 9.0 ...






Tableau is about to get a major overhaul and it looks very good. I was invited to the Beta and the final product pleasantly surprised me by its quality. It looks neat and sleek. But, what's new in the box? I will go through the improvements I liked.

1- The improved Look and Feel:

Right from the start, you can feel that the UX team has done some good work on the User Experience. At soon as you open it, you have that warm feeling. As a bonus, you can now save your datasource (which saves a good deal of time when starting a new project) and it would appear at the bottom left of the Welcome screen.



Folder: 

A major drawback of previous releases was the clutter that wide dataset would generate with 200+ data fields, in addition to the calculated fields. As of 9.0, Tableau introduces folders, a neat way to organize your data:


Very useful! 

Improved Calculated fields:
Now integrating auto-suggest (finally!), calculated fields are much easier to create and manage.


More responsive Tooltip:

The time you had to wait 3 seconds to have the tooltip pop-up is gone. Now, it is displayed shifty and provide a better interaction and rivals with D3.js in terms of interactivity.





2- Data Wrangling made easier:

We all had this moment when we forgot to split a field and ended up creating 2 calculated fields as we did not want to regenerate the table. Tableau now integrates this simple yet useful functionality in the data source view. Under the hood, the instructions are passed through the driver and translated into SQL which means that the loss in performance is minimal.




3- Analytics Pane - A quick access to simple Analytics:

Nothing really new here per say but I found that the Analytics were quite hard to reach in previous releases. Now, they are centralized in 1 pane and can be dragged and dropped. 



4- Other improvements:

  • Extended Support for statistical files: You can now read R files (.RData) and SPSS files (.sav)
  • Connector to SparkSQL: It was released in the 8.3 (although not visible to everyone) and is now accessible but everyone. It makes sense as DataBricks just released its Spark 1.3.0 with major upgrades on SparkSQL.
  • Map: Geography Lookup:You can now lookup cities, states or countries on a map and Tableau will zoom-in for you.
  • Parallel SQL Queries: I think this feature has some good and bad. For MPP/Hadoop databases, all requests are processed in parallel so it will not be helpful. For mature SQL databases (e.g. Oracle R11, SQL Server or PostgreSQL), it makes sense when most of the time spent is on the database side (which is rarely the case in my experience)
  • Performance enhancement: Although I have tested it, it seems that the team spent time doing CPU-level optimization (Data Engine Vectorization) and Parallel Aggregation

5-What is still missing:

  • Dynamic Parameters: I was honestly expecting it to be part of this release. Parameters are static lists by constructions and you can't use them as action filters. The Tableau Community has been pushing for it but it seems it was delayed. Let's hope it will be part of the next shipment! 
  • Easier Excel-like Conditional Formatting: Another popular community request. There are work-arounds but it would enhance the tables greatly and enable score-cards, a popular reporting tool. 

Conclusion:

Tableau has put some good work in this release, especially on making the front-end easier to use and more intuitive. I think it's part of Tableau strategy to go after more business users, where the market
is. This has proven to be the right move so far for them.