Guillaume's Blog - Data Science

Posts

Playing around with Vectorization

August 24, 2021

While I'm on paternity leave, I'm enjoying a bit of time off to write a blog article. Here we go! Vectorization of code is one of the key tenets of performance computing. Unfortunately, these aspects are quite hidden in high-level programming language (Python/R/SQL) and Data Scientists are unaware of the inner workings. As a matter of fact, even Software Engineer at Tech companies rarely go that low. This post aims at rediscovering them. Let's dive in! Problem Statement: Let's take a simple problem : A social media company wants to send invites to folks that have a 2nd degree connection (i.e. at least 1 friend in common) but are not friends yet. The problem gives you a N x N matrix containing bits (1=friend). How many emails would you send? Context : There are better ways to solve this problem (for instance, by storing the index of the friends as opposed to all bits). Here, we are taking the data format as a constraint. Implementations: 0- Baseline approach We start w

My new job at Lyft

August 04, 2019

In October, I joined Lyft as Data Science Manager for Core Mapping. I wish I could have posted this update a while ago but a big event got in the way (yes, we went public) ... While low visibility, Mapping turns out to a big deal for ride-sharing as it has influence on a lot of other services. In a nutshell, Mapping has an impact on pricing, driver dispatch, scheduled rides and customer XP. Also, it is usually the biggest friction point for seamless pickups. Why is Mapping important for ride sharing company ? Mapping has traditionally 4 components: Basemap (representation of the world as a graph), Locations (where are drivers, passengers ?), Routing (optimal paths between locations) and ETA (distance and time between locations). When you open your app, you see the following screen: A lot of information displayed is connected to the work by my team: The surrounding physical world : road segments, (train) stations, POI The pick-up ETA (3min) looks at drivers around t

5 rules for a productive Science team

June 16, 2019

Data Science is a new discipline. As such, companies (incl. Tech) are still trying to figure out the optimal configuration for the team. However, there are a few guiding principles that are important to follow so that your Science team does not fall apart. I have complied my own in this blog article and I'm sharing them with you to generate a discussion: 1- Have a single source of truth During my consulting years, I cannot recall how many of my clients had challenges that emerged largely from a single problem: teams looking at different data. Aligning on metrics and methodology is very important as it forces the team(s) into a "single view of the world". Without agreement, team progress is hampered due to definition conflict, and confusion (when metrics disagree). In practice, this can be achieved through a single "fact table", adequate documentation (with thorough definition and pointers to code) and availability of metrics where they matter

Focus: Shapley value

June 04, 2018

Game theory is a fascinating topic codifying and quantifying all sorts of interactions between stakeholders in a game. The most popular setup is the prisoner's dilemma but there is much more to it. Today, we will cover the Shapley value as I recently stumbled across this original yet relatively unknown concept. Problem at stake : Example 1: In a salary negotiation, the employee showcases his skills and what it brings to the company. But how much are these skills worth ? Example 2 : In a Joint Venture, each founding company brings expertise. What’s a fair distribution of the ownership/shares ? Example 3 : 2 (or 3) TelCo companies want to build a fiber network that would benefit both. What’s a fair payment breakdown ? When you think about these problems, most of us would tend to answer them through guesses: “You should ask for X% raise because you deserve it” but there is actually a theory for it Let’s look at the theory behind. Solution : This is kn

Scala :The Bridge Language

April 19, 2017

The Fragmented World of Languages Lots of changes in Data Science have happened in its every dimension: applications, algorithms and techniques, software and ,of course, languages. Languages tell a fascinating story because it is a reflection of the nature and the state-of-mind of the practitioners. Not so surprising, they have changed a lot over the years. When I first started getting interested in Data Science, SAS and Matlab were still sure bets. A few years in (2013 or so), R became the Lingua Franca around me: easy to code and understand, the vectorized calculations backed by the DataFrame API made it very practical for non-CS practitioners (read statisticians, engineering generalists like myself) to use. It did away with a lot of the lower level considerations and ended up making a simpler interface, predictably at the expenses of the CS crowd, loathing such abstractions. Today, I think we are at another junction where the ball is moving in the opposite direction: the CS

The future of Purchase Attribution

October 08, 2015

The Online Ad industry is alive and well. A specialist firm reports double digit growth where Total Display is expected to amount to over $90B worldwide by 2017, a roaring 18% CACR since 2014. With this influx of investment, Marketers will expect to see improvements in measurement methodologies and we are already seeing sign of it in in this area. The perennial problem of Purchase Attribution Let's take an example to illustrate this famous problem: Betty is an avid internet user. As she is showing signs of future Apparel purchase, H&M buys her address and send her a promotional email on Day 0. Two days later, she is targeted a second time on her mobile while doing a search on Google. Finally, on Day 4, she connects to Facebook and clicks one of those Carousel ad leading to a purchase on H&M site. Who should get the credit for the purchase ? Facebook? The Past: Last click The simplest (and now outdated) attribution methodology was to give credit t

Search This Blog