Posts

Should you ship this feature?

Image
Introduction During my time at tech companies, one of the hardest tasks for my team is to guide the Product team in their decision to launch a feature. It is both very consequential and also very fraught with pitfalls. Let's take an example to ease into the topic: You worked hard to convince leadership to measure the effect of this cool new feature and you got a nice experiment set up according to the cannons of measurement. After 4 weeks, you are seeing stat sig improvement in your primary metric and not secondary metrics affected negatively. The story is clear as day: you should tell the team to ship it, right?  The problem with shipping features Once the results are out, the focus is put on the outcomes and leadership is incentivized to ship. As a result, little attention is given to cost side of the equation: Tech debt : Does this feature make the system more complex? Does it hinder long term growth by exacting a tax from any new feature? Think about Net Prevent Value of such t

Playing around with Vectorization

Image
While I'm on paternity leave, I'm enjoying a bit of time off to write a blog article. Here we go! Vectorization of code is one of the key tenets of performance computing. Unfortunately, these aspects are quite hidden in high-level programming language (Python/R/SQL) and Data Scientists are unaware of the inner workings. As a matter of fact, even Software Engineer at Tech companies rarely go that low.   This post aims at rediscovering them. Let's dive in! Problem Statement: Let's take a simple problem : A social media company wants to send invites to folks that have a 2nd degree connection (i.e. at least 1 friend in common) but are not friends yet. The problem gives you a N x N matrix containing bits (1=friend). How many emails would you send? Context : There are better ways to solve this problem (for instance, by storing the index of the friends as opposed to all bits). Here, we are taking the data format as a constraint. Implementations: 0- Baseline approach We start w

My new job at Lyft

Image
In October, I joined Lyft as Data Science Manager for Core Mapping. I wish I could have posted this update a while ago but a big event got in the way (yes, we went public) ... While low visibility, Mapping turns out to a big deal for ride-sharing as it has influence on a lot of other services. In a nutshell, Mapping has an impact on pricing, driver dispatch, scheduled rides and customer XP. Also, it is usually the biggest friction point for seamless pickups. Why is Mapping important for ride sharing company ? Mapping has traditionally 4 components: Basemap (representation of the world as a graph), Locations (where are drivers, passengers ?), Routing (optimal paths between locations) and ETA (distance and time between locations). When you open your app, you see the following screen: A lot of information displayed is connected to the work by my team: The surrounding physical world : road segments, (train) stations, POI The pick-up ETA (3min) looks at drivers around t

5 rules for a productive Science team

Image
Data Science is a new discipline. As such, companies (incl. Tech) are still trying to figure out the optimal configuration for the team. However, there are a few guiding principles that are important to follow so that your Science team does not fall apart. I have complied my own in this blog article and I'm sharing them with you to generate a discussion: 1- Have a single source of truth During my consulting years, I cannot recall how many of my clients had challenges that emerged largely from a single problem: teams looking at different data.  Aligning on metrics and methodology is very important as it forces the team(s) into a "single view of the world". Without agreement, team progress is hampered due to definition conflict, and confusion (when metrics disagree). In practice, this can be achieved through a single "fact table", adequate documentation (with thorough definition and pointers to code) and availability of metrics where they matter

Focus: Shapley value

Image
Game theory is a fascinating topic codifying and quantifying all sorts of interactions between stakeholders in a game. The most popular setup is the prisoner's dilemma but there is much more to it. Today, we will cover the Shapley value as I recently stumbled across this original yet relatively unknown concept. Problem at stake : Example 1: In a salary negotiation, the employee showcases his skills and what it brings to the company. But how much are these skills worth ? Example 2 : In a Joint Venture, each founding company brings expertise. What’s a fair distribution of the ownership/shares ? Example 3 : 2 (or 3) TelCo companies want to build a fiber network that would benefit both. What’s a fair payment breakdown ? When you think about these problems, most of us would tend to answer them through guesses: “You should ask for X% raise because you deserve it” but there is actually a theory for it Let’s look at the theory behind. Solution : This is kn

Scala :The Bridge Language

Image
The Fragmented World of Languages Lots of changes in Data Science have happened in its every dimension: applications, algorithms and techniques, software and ,of course, languages. Languages tell a fascinating story because it is a reflection of the nature and the state-of-mind of the practitioners. Not so surprising, they have changed a lot over the years. When I first started getting interested in Data Science, SAS and Matlab were still sure bets. A few years in (2013 or so), R became the Lingua Franca around me: easy to code and understand, the vectorized calculations backed by the DataFrame API made it very practical for non-CS practitioners (read statisticians, engineering generalists like myself) to use. It did away with a lot of the lower level considerations and ended up making a simpler interface, predictably at the expenses of the CS crowd, loathing such abstractions. Today, I think we are at another junction where the ball is moving in the opposite direction: the CS

The future of Purchase Attribution

Image
The Online Ad industry is alive and well. A specialist firm reports double digit growth where Total Display is expected to amount to over $90B worldwide by 2017, a roaring 18% CACR since 2014. With this influx of investment, Marketers will expect to see improvements in measurement methodologies and we are already seeing sign of it in in this area.  The perennial problem of Purchase Attribution Let's take an example to illustrate this famous problem: Betty is an avid internet user. As she is showing signs of future Apparel purchase, H&M buys her address and send her a promotional email on Day 0. Two days later, she is targeted a second time on her mobile while doing a search on Google. Finally, on Day 4, she connects to Facebook and clicks one of those Carousel ad leading to a purchase on H&M site. Who should get the credit for the purchase ? Facebook?  The Past: Last click The simplest (and now outdated) attribution methodology was to give credit t