Sunday, August 4, 2019

My new job at Lyft

In October, I joined Lyft as Data Science Manager for Core Mapping. I wish I could have posted this update a while ago but a big event got in the way (yes, we went public) ...

While low visibility, Mapping turns out to a big deal for ride-sharing as it has influence on a lot of other services. In a nutshell, Mapping has an impact on pricing, driver dispatch, scheduled rides and customer XP. Also, it is usually the biggest friction point for seamless pickups.

Why is Mapping important for ride sharing company ?

Mapping has traditionally 4 components: Basemap (representation of the world as a graph), Locations (where are drivers, passengers ?), Routing (optimal paths between locations) and ETA (distance and time between locations).

When you open your app, you see the following screen:

A lot of information displayed is connected to the work by my team:
  • The surrounding physical world: road segments, (train) stations, POI
  • The pick-up ETA (3min) looks at drivers around the PIN and compares it to demand. This gives an estimate how fast we can get a car to you
  • The drop-off ETA (10:03) estimates when we can get you to your destination 
  • The price ($36.66) which uses a lot of indicators including dropoff ETAs to make the ride fair for both parties: riders and drivers
  • The polyline (purple) showing the route to your destination

Once you click on "Request", the application will dispatch a driver to you. This decision is central to the efficiency of the platform and therefore very sophisticated. Getting a driver to you ASAP (using our ETAs) is of critical importance to make for the best user experience. Here, there are usually two ways of solving it: the greedy way (dispatching the closest driver) and the optimal way (solving the problem as a global minimization problem).

Exciting problems in Core Mapping 

What are some of the problems in Mapping that are very interesting in nature ? This is a broad question and I will just sample it down a few:

How does driver location influence ETA?
We collect GPS signal from our drivers by way of streaming to our services. There is a chance that it can be wrong: For instance, snapping the driver to the wrong road segment often occurs is urban canyon.

In the example below, what if the driver is falsely matched to Bryan Street but is actually about to enter I-80 ? Dispatching this driver would arguably be disastrous: The driver getting on I-80 may have to drive all the way to Oakland and back ! A small map match variation (in distance) has a non-linear relationship with ETA.

How does ETA influence Dispatch?

To provide the best user experience, dispatch uses pairwise (DVR,PAX) ETA to make the best Supply <==> Demand match. In the ideal state, dispatch minimizes the SUM(duration) across all dispatches. However, we only measure the dispatches that occurred and have little information about those that did not materialize. Gaining a better understanding of what leads to "Bad dispatch" (Whether false positive or false negative) is a central topic for my team and for Lyft !

As Lyft speeds up into 2020, Mapping will become an even more important topic e.g. for multi modal transportation. Reach out to me if you want to join the team !

Sunday, June 16, 2019

5 rules for a productive Science team

Data Science is a new discipline. As such, companies (incl. Tech) are still trying to figure out the optimal configuration for the team. However, there are a few guiding principles that are important to follow so that your Science team does not fall apart. I have complied my own in this blog article and I'm sharing them with you to generate a discussion:

1- Have a single source of truth

During my consulting years, I cannot recall how many of my clients had challenges that emerged largely from a single problem: teams looking at different data. 

Aligning on metrics and methodology is very important as it forces the team(s) into a "single view of the world". Without agreement, team progress is hampered due to definition conflict, and confusion (when metrics disagree).

In practice, this can be achieved through a single "fact table", adequate documentation (with thorough definition and pointers to code) and availability of metrics where they matter (for instance, on your experimentation platform) so that nobody is tempted to create his own metrics !  

2- Prioritize Iteration speed 

Iteration speed eats theoretical work for breakfast. 

At tech companies, systems are big and complicated. As a result, it may be tempting to work on theoretical work and sacrifice iterations. 

Instead, I would advocate that you find a way to simplify the problem to a point where an iteration can be done rapidly (aim for an order of magnitude of 30-60min). On the data side, downsampling is a common technique. Looking at a single use case is another. On the code side, carving out the codebase so that it can be run into an interactive notebook (e.g. Jupyter) works too. Scaling to larger data / scope will come when your solution is ready ! 

3- Observability is everything

Lord Kelvin lived 200 years ago but one of his principles still applies well today: "If you can't measure it, you can't improve it". 

By defining your metrics properly, you are setting the direction of the team's effort. Further, by measuring everything, you are measuring the progress toward your target and enabling the team to diagnose problems along the way. 

4- Complexity is bad

Tech and Consulting have the resources to hire talented employees. This is generally good: Talented employees love to work with equally talented employees and build great products ! 

However, there is one downside: talented employees love to build complex stuff ! Complexity is often overlooked when the gain is palatable (or when the improvement is "fancy" - for instance, using the latest Neural Net config). Proper diligence on how much burden this is adding to iteration cycle time, debugging, and reproducibility is key to keeping the systems lean and the team moving.

5- Naming convention is surprisingly important

Surely, naming variables is no biggie, isn't it ? You can't be further from the truth ! Bad variable names makes the code more complicated to read and worse, introduce bugs ! 
  • Always specify units of measure: weight_kg is much better than weight
  • Be specific ! Don't name your variables: var or tmp (really, don't)
  • Indicating types with a suffix doesn't hurt for boolean: for instance is_valid_ind or is_valid_flag
  • Stick to your team conventions. For instance, 1/0 can be either coded with a boolean or with a tinyint

What are your rules ? Let's have a discussion ! 

Monday, June 4, 2018

Focus: Shapley value

Game theory is a fascinating topic codifying and quantifying all sorts of interactions between stakeholders in a game. The most popular setup is the prisoner's dilemma but there is much more to it. Today, we will cover the Shapley value as I recently stumbled across this original yet relatively unknown concept.

Problem at stake:

Example 1: In a salary negotiation, the employee showcases his skills and what it brings to the company. But how much are these skills worth ?
Example 2: In a Joint Venture, each founding company brings expertise. What’s a fair distribution of the ownership/shares ?
Example 3: 2 (or 3) TelCo companies want to build a fiber network that would benefit both. What’s a fair payment breakdown ?

When you think about these problems, most of us would tend to answer them through guesses: “You should ask for X% raise because you deserve it” but there is actually a theory for it

Let’s look at the theory behind.


This is known as a cooperative game and there is only one breakdown function that fulfill a few conditions (more in this later).

The gist of it: “Player A fair reward is the average of his marginal contributions to the different coalitions leading to the final setup“ where:

For a game with 3 players (A,B,C), we denote:
-          Final setup: final set of stakeholders S {A,B,C}
-          Coalition: a subset of S
-          Marginal contribution: Adding A to {A,B} is : Value {A,B,C} – Value {B,C}

Now, let’s introduce value to the example. For instance:
-          V(A) = 12
-          V(B) = 8
-          V(C) = 2
-          V(A,B) = 22
-          V(A,C) = 15
-          V(B,C) = 11
-          V(A,B,C) = 23

The Sharpley value of A is calculated as:


From there, you can intuit the general formula where n! is the total number of permutations :

Where K \ A notes the coalition K without A.

Wednesday, April 19, 2017

Scala :The Bridge Language

The Fragmented World of Languages

Lots of changes in Data Science have happened in its every dimension: applications, algorithms and techniques, software and ,of course, languages. Languages tell a fascinating story because it is a reflection of the nature and the state-of-mind of the practitioners. Not so surprising, they have changed a lot over the years.

When I first started getting interested in Data Science, SAS and Matlab were still sure bets. A few years in (2013 or so), R became the Lingua Franca around me: easy to code and understand, the vectorized calculations backed by the DataFrame API made it very practical for non-CS practitioners (read statisticians, engineering generalists like myself) to use. It did away with a lot of the lower level considerations and ended up making a simpler interface, predictably at the expenses of the CS crowd, loathing such abstractions.

Today, I think we are at another junction where the ball is moving in the opposite direction: the CS languages that were catering to the Computer Scientist community are now becoming increasingly easy to use, its most prominent member being Martin Odersky's Scala. Let's dive in.


Scala: the Bridge Language

My personal experience with Scala goes back to Scala in 2015 when I wanted to better understand Apache Spark. The language is touted as a user-friendly alternative to Java (both running on the JVM) that also has lots of additional features like macros and Functional Programming constructs. Right from the get go, the language strikes by its ease of use (the REPL really helps) and its clean syntax.

Scala is unique because it can both fulfill the needs of Programmers and Data Scientists: It is very much at the junction of the 2 worlds and is therefore, what I call a "Bridge Language" in that sense. Smart companies, e.g. LinkedIn, Square and Spotify to name a few, have been fast at understanding the benefits of it: more productivity, more collaboration and more readability. All the sudden, the code did not need rewriting for production use !

What Scala can do for you:

Scala can use any Java project which makes it easy to leverage all the work that has taken place in Java for the last 2 decades. If you don't know Java, this is a good entry door because Scala is more user friendly. If you know Java, Scala is worth learning to become more productive. 

Besides, additional Scala libraries bring top notch code to you:

Data Science: 
  • Apache Spark: The ultimate tool for Big Data Processing with its MLLib for Data Science
  • Apache Kafka: The go-to tool for stream distribution and processing
  • Apache Akka: Actors (quite low level but easy to use)
Web Dev:
  • Play Framework: Modern Web Applications for everybody 
  • Slick: Functional connector to databases
Those libraries are very well documented. You can usually get started in minutes even if you're not an expert in the field.

How to get started:

I think getting your feet wet is the biggest hurdle - mostly psychological but there are also some technical difficulties that can be hard to overcome if you don't follow the right sequence. Here is my recommended guide to nail it:

  1. I would start with the Martin's Coursera class. It is a great intro to Functional Programming (which you will encounter often in Scala) and it also features great materials about the installation. Try to do the whole specialization if you have time. I would encourage you to use IntelliJ for your IDE, a better alternative to Eclipse.
  2. If you're a Data Scientist, try to learn more about Data Structures using Scala. I followed the Data Structure and Algo specialization on Coursera, a $420 investment you won't regret. If you are a Software Engineer, chances are you already have this knowledge (but it could be a good refresher)
  3. Choose a side project to hone your skills. Here are a few ideas:
    1. Create a web service with Play
    2. Put together a Data Pipeline with Spark
    3. Replace a small Java App with Scala (or code your next Java App with Scala)
Good luck !

Have experience using Scala ? Let me know what you think ! 

Thursday, October 8, 2015

The future of Purchase Attribution

The Online Ad industry is alive and well. A specialist firm reports double digit growth where Total Display is expected to amount to over $90B worldwide by 2017, a roaring 18% CACR since 2014.

With this influx of investment, Marketers will expect to see improvements in measurement methodologies and we are already seeing sign of it in in this area. 

The perennial problem of Purchase Attribution

Let's take an example to illustrate this famous problem: Betty is an avid internet user. As she is showing signs of future Apparel purchase, H&M buys her address and send her a promotional email on Day 0. Two days later, she is targeted a second time on her mobile while doing a search on Google. Finally, on Day 4, she connects to Facebook and clicks one of those Carousel ad leading to a purchase on H&M site.

Who should get the credit for the purchase ? Facebook? 

The Past: Last click

The simplest (and now outdated) attribution methodology was to give credit to the last ad, thus its name "Last Click". While simple and easily understandable by everyone, it has the flaw of not giving any credit to 2 first interactions although they might have been necessary to initiate Betty's interest in H&M.

The present: Time Decay

A few years ago, Google Analytics introduced many alternative attribution methodologies including Time Decay. This method gives more credits to more recent events. Consequently, Paid Email and Mobile Ads get respectively $20 and $30 credits in this illustrated example. A good step forward! 

That said, the fundamental problem remains: What would have happened had the Marketer not bought the FB ad? How about not buying the mobile ad? Or the Paid email? All in all, what would be the best combination of marketing stimuli here?

If the Time Decay methodology holds true, it can be implied that the purchase would not have been made in 50% of the cases, had the Marketer not bought the FB ad. Is that accurate? Can we find a better way to estimate the true effect of each interaction. This is where Uplift comes into Play

The Future: Uplift

What happened to Direct Mail where more and more companies use Control Groups to measure the true incremental will undoubtedly roll into the Online Ad space in the future. This is an exciting time! 

We will start seeing systematic control groups on ads. For each interaction on the customer journey, the marketer will be able to assess the true effect of each ad, that is the delta between the group that was exposed to the ad and the group that was not (or exposed to a competitor's ad). This is very powerful as it will provide a view on effects as opposed to pure correlations. Plus, it is an accurate and scientific way as opposed to a highly subjective methodology making use of parameters, e.g. the time decay coefficient, that are set arbitrarily.  

What it means for you

Uplift Measurement is a game changer. It will force ads providers to change their mindset from "being on the customer journey" to "altering the customer journey". It will certainly make ads more relevant and provide more targeted content. I also personally believe that behemoths like Facebook will face more pressure from the Marketers to get results. 

In addition, one may argue that it could decrease the "perceived value of ads". In the chart above, the amount to split between ads used to be $100 but is now only $10 (the outstanding $90 represent the amount that would be spent if no ad was displayed). Therefore, moving to an incremental approach could hamper online investments and force the Online Ad Industry to focus more on results.

Some Agencies have started working on this topic. I recently attend a pitch by Numberly and they are going right into that direction. The game is on! 

Tuesday, July 7, 2015

Uplift vs. Response: The Targeting Dilemma

Targeting is always reinventing itself. Who should be targeted for ads display? What should be the target customers for a direct mailing campaign or an email blast? This has been quite puzzling to Marketers in recent years. Software providers and Marketing Agencies have been quick to seize this opportunity to offer a solution to it: Uplift Modeling. Let's deep dive into what this new realm exactly means.

1- What is Uplift:

Let's look at 2 groups of customers to get familiar with the concept. 

A pure response model would dictate to target the segment with the highest Response Rate. Now, is it the right approach?

One may argue that maximum effectiveness, or Uplift, is a better criterion since 5-2 = 3 more customers would convert every 100 customers. 

So, Uplift is really the art of maximum efficiency by subtracting out the "base response". A corollary to this is to take out customers that are inelastic, either they cannot be convinced to take the offer (the lost cause) OR they will shop anyway (the die hard). 

2- What is required to model Uplift?

a. Raise Awareness:

The first requirement is to realize that Response is not a good KPI. In my experience, it is sometimes difficult to convince management that their Marketing Department has been using the wrong KPI for such a long time. Plus, switching from a measurable to a latent metric can be quite daunting! Can Uplift be measured anyway?

b. Be Proactive at measuring Uplift:

With management support, the team can move to start creating a performance report. What is the actual share of Demand of of total Response? This is the realm of A/B testing which is a quite well studied topic. A typical "Gotcha" is the size of the Control Group. Too small of a control group means that the measurement may be misleading due to uncertainty (see Interval of Confidence) so one should be generous with volume in the learning stage to ensure proper measurements. There are tools out there to calculate the right size.

c. Model Uplift:

 Now the team has a good handle of the current performance of your selection process, it is now time to improve it using a modeling approach.  

The model needs a data from a randomized experiment, that is sending a Marketing Treatment randomly. As you may expect, this data can be expensive to generate but it will unlock insights that will not be identified otherwise.

The objective of the model is to segregate customer groups based on their Uplift. For instance, the tree below shows 4 distinct groups with Uplift random from 0% to 10%. A smart Marketer would first target the best group with 10% uplift then work her way down to the second leaf at 8%, etc.

3- What options are available out there?

Now you are all excited about the Uplift Modeling, it's time to make the first step. I know only 2 options in the Market using a mainstream modeling environment. Truth is that the offering is quite limited.

Product  Owner / Author Resources
SAS Incremental Response Modeling SAS Enterprise http://support.sas.com/resources/papers/proceedings13/096-2013.pdf 
R Uplift Leo Guelman http://cran.r-project.org/web/packages/uplift/index.html

Then, there are plenty of vendors/agencies that do Uplift modeling like Portrait Software (product), the regular Marketing Agencies / Consultancies.

4- Further Reading

To explore more on the topic, here are a few articles:
  • An "OK" vulgarization article: http://www.dummies.com/how-to/content/basics-of-uplift-predictive-analytics-models.html 
  • A good research article http://stochasticsolutions.com/pdf/sig-based-up-trees.pdf
  • Another good article (by the author of the Uplift package in R): https://ideas.repec.org/p/bak/wpaper/201406.html 

Monday, May 25, 2015

Saving $12K on modeling jobs in AWS

Your AWS bill is going through the roof and you don't know what steps to take? Although lots of opinion leaders have voiced the somewhat flawed statement that "AWS is cheap", the reality is different where one needs to careful monitor costs since it became as simple as pushing a button to get extra resources.

AWS is still expensive:

AWS has lowered its price repeatedly  (42 times as of last year according to AWS) in the past years in a race-to-the-bottom with Google Cloud. That said, it can still run up to $17K a year for a large on-demand instance as the graph below shows:

Modeling jobs are "peak demand":

Modeling jobs need lots of resources for a short amount of time, days at the most. Consequently, they should be treated as peak demand which make them among the first candidates for Elastic Computing. Plus, it can easily be bundled as "Data + Instructions package" making it easy outsource it into another server.

Starcluster to the rescue:

StarCluster is an open-source utility software that was developed by the MIT. It was first developed for resource-hungry MIT students running Simulation Jobs but has not migrated to the Cloud, AWS in particular.

In practice, StarCluster allows the user to spin 100%-ready clusters on demand. As such, installations of requirement packages (like password-less SSH, Network File System for the infrastructure side and OpenMPI, OpenBLAS for the distibuted computing side) will be handled seamlessly by StarCluster so that the user focuses on the high-value tasks. Only 15-20min are needed to get a cluster of any size ready to crunch data!

Beside, the organization can put back the power into the users' hands by letting them setting up their cluster on-demand through a simple configuration file provided by StarCluster:


All things considered, the winning formula would be to consider down-scaling the current infrastructure and transfer the big modeling jobs to StarCluster. For instance, instead of having a 4xlarge, it would make sense having only 1 X-Large reserved instance for the day-to-day operations and offload the rest to a StarCluster setup. In the process, you would save up to $12K of expenditures.

Also noteworthy, it also comes with Python pre-installed (if you're more of a Python fan).