5 rules for a productive Science team

Data Science is a new discipline. As such, companies (incl. Tech) are still trying to figure out the optimal configuration for the team. However, there are a few guiding principles that are important to follow so that your Science team does not fall apart. I have complied my own in this blog article and I'm sharing them with you to generate a discussion:

1- Have a single source of truth

During my consulting years, I cannot recall how many of my clients had challenges that emerged largely from a single problem: teams looking at different data. 

Aligning on metrics and methodology is very important as it forces the team(s) into a "single view of the world". Without agreement, team progress is hampered due to definition conflict, and confusion (when metrics disagree).

In practice, this can be achieved through a single "fact table", adequate documentation (with thorough definition and pointers to code) and availability of metrics where they matter (for instance, on your experimentation platform) so that nobody is tempted to create his own metrics !  

2- Prioritize Iteration speed 

Iteration speed eats theoretical work for breakfast. 

At tech companies, systems are big and complicated. As a result, it may be tempting to work on theoretical work and sacrifice iterations. 

Instead, I would advocate that you find a way to simplify the problem to a point where an iteration can be done rapidly (aim for an order of magnitude of 30-60min). On the data side, downsampling is a common technique. Looking at a single use case is another. On the code side, carving out the codebase so that it can be run into an interactive notebook (e.g. Jupyter) works too. Scaling to larger data / scope will come when your solution is ready ! 

3- Observability is everything

Lord Kelvin lived 200 years ago but one of his principles still applies well today: "If you can't measure it, you can't improve it". 

By defining your metrics properly, you are setting the direction of the team's effort. Further, by measuring everything, you are measuring the progress toward your target and enabling the team to diagnose problems along the way. 

4- Complexity is bad

Tech and Consulting have the resources to hire talented employees. This is generally good: Talented employees love to work with equally talented employees and build great products ! 

However, there is one downside: talented employees love to build complex stuff ! Complexity is often overlooked when the gain is palatable (or when the improvement is "fancy" - for instance, using the latest Neural Net config). Proper diligence on how much burden this is adding to iteration cycle time, debugging, and reproducibility is key to keeping the systems lean and the team moving.

5- Naming convention is surprisingly important

Surely, naming variables is no biggie, isn't it ? You can't be further from the truth ! Bad variable names makes the code more complicated to read and worse, introduce bugs ! 
  • Always specify units of measure: weight_kg is much better than weight
  • Be specific ! Don't name your variables: var or tmp (really, don't)
  • Indicating types with a suffix doesn't hurt for boolean: for instance is_valid_ind or is_valid_flag
  • Stick to your team conventions. For instance, 1/0 can be either coded with a boolean or with a tinyint

What are your rules ? Let's have a discussion ! 


Popular posts from this blog

Saving $12K on modeling jobs in AWS

My new job at Lyft

Why Microsoft shouldn't be overlooked in Data Science