Guillaume Guy

What is CLIP and why does it matter?

Guillaume Guy — Mon, 12 May 2025 14:27:26 GMT

Paper: Learning Transferable Visual Models From Natural Language Supervision, Radford et al. (2021)

Introduction

As we discussed in previous posts, Contrastive Learning isn't new. For example, FAIR's Moco was published in 2019, along with many other papers. However, CLIP introduces something new. Let's take a step back and look at the landscape at the time of its publication.

Around 2021, there were two major areas of work in computer vision:

Well-structured predictions using classes (e.g., ImageNet)
Self-Supervised Learning, which removes the need for expensive labels

Meanwhile, the team at OpenAI noticed that the web had over a billion untapped images with captions that hadn't been used due to the challenges of handling natural language. For instance, Li et al. (2017) tried to predict phrase n-grams based on photos, but the approach didn't generalize well, leading to low scores on ImageNet.

The question was clear: how can this web data be used to train effective, generalist computer vision models? And can we leverage both text and image modalities?

Modality alignment

The main claim of this seminal paper is twofold:

One can approach the challenge as a modality alignment problem i.e. a shared latent space
Trained on the web, the resulting model gives good zero-shot performance across a variety of tasks i.e. generalize well.

On the first claim, their architecture is relatively simple: they put 2 encoders (one for each modality). Note that they do not do any decoding (i.e. asking to generate the caption from the image). Instead, their objective is to “align” these 2 encoders. For instance, the picture of the Aussie dog (see below) is in the same vicinity in the latent space as the caption “Pepper the aussie pup”. 2 encoders, 1 shared latent space.

On the 2nd claim, they show that, the model provides rich semantic representation and beats self supervised method such as SimCLRv2 on downstream classification with zero-shot (i.e. no training), which is an incredible claim. Further, with the same number of training examples, CLIP is 10pp more performant.

Why does CLIP matter?

CLIP is the first computer vision architecture to "mine the web," effectively overcoming the challenges of poor data quality. Its strength lies in its ability to generalize across a wide range of domains, making it a strong choice for your vision backbone.

Secondly, the photo embeddings are semantically supervised, making them suitable for giving vision to LLMs (more details in my blog post here). As a matter of fact, CLIP has become the main (only?) vision backbone for LLMs.

Finally, it democratizes image search. By encoding all images and adding them to a vector store (e.g., FAISS), one can find images similar to a text or image prompt. This technique has become popular because it eliminates the need for complex infrastructure.

It is no surprise that AI labs continue to invest heavily in CLIP-style learning. For instance, Google recently released SIGLIP-2, which incorporates additional losses to address some weaknesses. For example, CLIP is not well-suited for segmentation and depth estimation tasks, so the authors added LocCa tasks. FAIR expanded to many modalities with ImageBind.

Conclusion

Go check it out on HuggingFace. SigLIP-2 is open weights and licensed under a permissive Apache-2.

How did LLMs gain Vision?

Guillaume Guy — Mon, 05 May 2025 15:11:09 GMT

Introduction

LLMs were created as text-only chat bots as an artifact of their training paradigm: they learn to predict the next token (~word). Photos, videos and other richer media don’t have words and therefore were naturally excluded from the training protocol.

However, most LLMs are now multimodal. How did that happen?

Image as Cross-Encoded signals

Paper

One of the early papers to explore multi-modality is Deepmind’s Flamingo from 2021. Their technique consisted in interleaving text and photos, and having the LLM complete the sentence by following the pattern.

Their architecture is relatively simple and can be summarized in 3 points (*):

Keep the Vision Encoder and the language model frozen (the implicit assumption is that both are “multimodal-ready” since they are both semantically supervised so the only missing part to make the system work is the “bridge” between the 2 “languages” ). This set up reduces greatly the number of learnt parameters
Add a (learnt) token to represent the location of the n-th photo.
For each layer of the LM, add a cross-attention to attend to the photo(s) representation

(\) leaving out the Perceiver Resampler, which is an interesting component but not central to the architecture. More details in the paper*

Image as a dense token

Paper

PaliGemma, a recent 3B model released by Google in 2024, shows another way to integrate vision signals. It uses the output from the Image Encoder as dense tokens, which can be combined with text tokens after a projection. This way, the LLM encounters new dense tokens that aren't part of the token dictionary and can handle them just like text tokens.

The paper is very accessible and full of useful details for replication. Specifically, the authors added a clear step-by-step detailed recipe:

The above architecture is interesting in several ways:

First, it eliminates the cross-attention system, which is computationally demanding because it is added to each layer of the LLM.
Instead, the authors use a simple "Linear Projection," a technique previously used by LLaVA.
The authors allowed the Image Encoder to be fine-tuned, stating that it improves performance in the "blind spot" of its contrastive pretraining, specifically in relation and localization (*).

(*) It would be interesting to see if this remains true, as SigLIP-2 tried to address these blind spots by adding extra losses to capture these tasks.What matters when tuning an VLM?

What matters when training a VLM?

paper

In 2024, the team at HuggingFace aimed to find out what matters when tuning a VLM. They reached the following conclusions:

Enhancements in pretraining (for LLM and vision encoders) result in better performance in later tasks (Finding #1).
Cross-attention generally provides better performance than fully auto-regressive models, but it requires 10% more computing power. However, this difference disappears when using LoRA adapters (Finding #2).

The paper also contains many other insights, such as the link between image splitting and performance on reading tasks.

Conclusion

We compared two major early architectures for integrating vision into LLMs that have been popular up to 2024. With "native LLMs," we expect to see more co-trained multimodal models in the future. Last year, GPT-4o pioneered this approach, but the architecture and training details are still unclear. One hypothesis supported by Gwern suggests it uses a VQ-VAE to tokenize photos and integrate them into their token dictionary. With interleaved data, the LLM's autoregressive training can be applied "natively," meaning the LLM will need to predict tokens in either modality. Only time will tell.

The state of Self-Supervision for Vision

Guillaume Guy — Thu, 01 May 2025 15:38:52 GMT

Introduction

To perform vision tasks effectively, it's important to have a strong, general-purpose vision backbone. This allows you to handle many vision tasks, such as:

Image-to-image similarity: For comparison or retrieval.
Vision adapters: Add a classification head to solve specific problems (e.g., cat vs. dog classification).
Fine-tuning: Adapt the backbone to a specific problem (similar to adapters, usually with higher accuracy but requires recompiling the entire backbone).
Giving vision to LLM: By feeding the vision hidden state to the LLM (see PaliGemma as an example), the LLM can learn to interpret images.

The backbone serves as a starting point for downstream activities. The better the backbone, the better the performance on these tasks.

Currently, there's a lot of excitement about CLIP, which is a method to semantically supervise the hidden state (using "words"). It tends to deliver great performance, easier integration with LLMs (since LLMs are semantic machines), and zero-shot capabilities (allowing image retrieval based on text-based queries). I will write a post on this topic later, but for now, let's discuss self-supervision (usually shortened to SSL)..

Selective history of Self-supervision

Let’s investigate a few seminal papers to see where we’re coming from.

1- Unsupervised Visual Representation Learning by Context Prediction, 2015

Paper

The authors (Doersch et al) formulated the self supervision task as that of a puzzle: Split the image into patches and have the model predict where a given patch should be slotted relative to a reference patch. This is called a “pretext task” (meaning, it’s not the overarching objective of the model, but rather a way to force the model to train on a task that will be useful for other applications).

The model is of a Siamese type as it takes in 2 patches, processes them (with the same weights) and has a late fusion at the end to compute the predicted location of the 2nd patch.

The most fascinating finding of this paper is that the model, despite having no semantic information about the photos or patches, learns mid-level semantic structures (e.g., distinguishing objects and surfaces) even though the training signal is purely geometric. For instance, the below figure shows an uncanny ability to discern fixtures in a city. We can surmise that the model builds its own internal conceptual representation of the world.

Finally, the authors shows that these trained models can be helpful for many downstream applications such as object detection and Geometry Estimation (section 4).

However, because the task is local (patch-level), it can bias the model toward capturing only short-range dependencies rather than full object understanding.

Top 1 accuracy on ImageNet-1K: 51.4% @ 100M params (source)

2- MOCO: Momentum Contrast for Unsupervised Visual Representation Learning, 2020

Paper, code

Moco’s core improvement is in the realm of “hard negative mining” strategies, which aim to select negatives that maximize learning efficiency.

To improve contrastive learning, MOCO’s proposes to bring in 2 innovations:

Momentum Encoder: A slow-moving copy of the query encoder, updated via exponential moving average, ensuring the encoded keys evolve smoothly rather than changing rapidly.
Queue: A dynamically maintained FIFO queue of previous embeddings (keys), allowing the dictionary of negatives to be much larger than the minibatch size without prohibitive memory cost. The queue resembles a memory bank but is simpler and fresher: it maintains only a limited number of recent embeddings, without associating them to specific dataset indices.

So, let’s sum it up. The model needs positive and negatives

Positives: A positive pair consists of two random augmentations of the same image: one passed through the query encoder (actively updated) and one through the momentum encoder (slow-moving copy).
Negative: Anything in queue

What’s the current challenge?

Traditional end-to-end contrastive learning depends on large batch sizes (to supply many negatives), which quickly run into GPU memory limits.

What’s MOCO contribution?

By decoupling the dictionary size from the minibatch size (via a queue) and ensuring representation consistency (via a momentum encoder), MoCo replicates and exceeds the performance of large-batch end-to-end contrastive learning — scaling up to queues with 65k negatives.

Very cool!

Top 1 accuracy on ImageNet-1K: 65.4% @ 100M params (source)

3- SimCLR, 2020

Paper

Some papers achieve higher performance by adding complexity; the best ones achieve both higher performance and lower complexity. SimCLR belongs to the latter: it discards specialized components like memory banks and momentum encoders, offering a simpler, more scalable framework for contrastive learning.

Figure: Overall architecture of the SimCLR contrastive learning framework

What is SimCLR’s contribution?

First, it introduces a simplified contrastive learning pipeline: given an image, two independent but carefully selected augmentations are applied to create a positive pair. A single model (no 2nd momemtum encoder) is trained to distinguish this pair from all others in a large batch (typically 8k examples), treating all other examples as negatives (i.e. contrastive learning).
To further improve learned representations, SimCLR decouples the encoder from a lightweight projection head used only during training. The contrastive loss is applied to the outputs of this projection head, allowing the encoder to retain richer, more transferable features. This architectural tweak significantly boosts linear evaluation performance (e.g., from ~50% to ~65% Top-1 on ImageNet-1K).
Critically, SimCLR systematically studies the role of data augmentation. They find that augmentation strategies are not equally effective: combinations like random cropping plus color jittering dramatically outperform others (e.g., crop+color achieving 56% vs Sobel+rotate yielding only 4%).

Let’s dive in the “set of transformers” mentioned above. The table below is an analysis of different augmentation compositions and shows that they are not all equal!

Overall, SimCLR delivers a substantial performance improvement — almost 10 percentage points over previous contrastive methods — while simplifying the overall framework.

Top 1 accuracy on ImageNet-1K: 69% @ 100M params (source)

4- Dino, 2021

Paper, code

Finally, let’s talk about Dino from FAIR.

Dino approaches the problem of Self Supervision (SSL) as a distillation problem i.e. having 2 models (teacher + student) where the student learns from the teacher. This paradigm is common in the LLM space when the teacher is a much larger, pretrained model, and needs to be shrunk for inference efficiency purpose (denoted “mini” in the OpenAI parlance). However, in DINO, the teacher starts from scratch too and improves over time. A more descriptive term would therefore be "pair-learning with a centered partner".

Ok, let’s talk about the DINO’s contributions now:

Self Supervision as a Distillation problem: The role of the student is to match the Teacher’s distribution of scores. The teacher is also learning through a momentum update (its weights are averaged with the student’s scaled with a small coefficient)
Avoiding collapse without using contrastive loss. Instead, they “center and sharpen” the teacher: Dino found a surprising simple rule: They simply center (subtract the moving average in the latent space) to avoid 1 dimension from dominating and sharpen (i.e. softmax with low temperature) to avoid a uniform distribution.

Now, how does the teacher “learn”? Its weights are actually a exponential moving average of the students’ (no SGD update)

Figure: Diagram of the DINO training protocol

One interesting aspect is that the Teacher always keeps an edge on the student as shown in the figure above. The reason might be that the student’s role is to explore while the teacher keeps the student “on the right tracks” (through the momentum mechanism).

Top 1 accuracy on ImageNet-1K: 75% @ 23M params, ResNet-50 (source)

Conclusion

Doing a large paper review over a large time range always bring an interesting perspective on the progression of the field as the high level view sorts out the important from the unimportant. Looking back at these papers, a few key innovations strike me as important:

At the highest level, SSL + classification has kept pace with full supervised learning. This is a major advantage as SSL only requires the photos without the need for expensive, and slow, labels. Yet, to my knowledge, only a few companies have applied these techniques.
The Teacher+Student paradigm works stunningly well as an SSL technique.
All reviewed authors take special care in their data augmentation. This aspect is critical to creating good visual representation.
Momentum encoders, used as a teacher in Moco and Dino, are a simple yet useful construct to guide a more eager encoders. This concept may well expand beyond SSL.

Can you represent some of your problems as a Self Supervised Learning?

Don't use raw embeddings

Guillaume Guy — Tue, 15 Apr 2025 20:51:13 GMT

Introduction:

With the rise of Transformers, embeddings are now widely used:

As representations of images or texts that can be used by other models or in a zero-shot manner
As a basic building block for Vector Search in LLM RAG and image search

However, embeddings are still quite large. OpenAI’s text-embedding-3-large can reach up to d=3072, which means 6kB (stored as float32) per entity. From experience, this is enough to overwhelm SQL engines when performing large JOINs, as this data needs to be sent across the network for a distributed JOIN.

Therefore, it makes sense to try compressing these embeddings into a smaller, yet high-quality, representation.

Vector Quantization:

Vector Search has been around a while but became truly popular in 2022 (see Trend below). However, with much foresight, Facebook released their vector search codebase FAISS in 2017.

A common challenge with vector search is storing all vectors in memory, which can be quite costly when the vectors are large. H. Jégou et al. (paper) introduced Product Quantization in 2010. The main idea is to divide a long vector into smaller chunks (about 4 dimensions each) and apply k-means clustering to each chunk. Each vector is then represented by its closest centroid. With enough centroids, the loss is expected to be minimal.

The illustration below shows how this works. It displays 2 chunks of 4 columns and their closest centroids (125 and 12) for encoding.

To decode at runtime, the vector is reconstructed by looking up the coordinates of the centroids and combining them back into the original vector space.

You can also calculate the space saved:

Current: 3072 at float32 = 6kB
Product Quantized (dim=4, with 256 centroids stored as uint8): (3072 / 4) * 1 byte = 768B (8x smaller)

Product Quantization (PQ) is a simple yet effective technique to save space. Although there are other methods, PQ is still widely used in the industry.

To illustrate, we implemented it in a few lines of Numpy code in this notebook in this gist.

CPU/GPU Friendliness:

A keen observer will notice the following:

Each group of columns can be processed in parallel, making computation highly parallel.
The operations are basic: matrix multiplication and lookups (i.e. gather), which are highly optimized on both CPUs and GPUs.

This makes Product Quantization efficient on almost all modern hardware.

Going Further:

T. Ge et al. (CVPR 2013) (link) improved PQ by adding a rotation step, which significantly reduces the loss. From experience, a cosine similarity metric above 99.X% enable most downstream use cases.

Takeaway:

While embeddings are useful for many applications, they are often large and hard to manage. By quantizing them and using the codes instead of the raw embeddings, you can improve their usability while keeping them close to the original vector. Check it out!

Focus: Shapley value

Guillaume Guy — Tue, 15 Apr 2025 16:22:31 GMT

Game theory is a fascinating topic codifying and quantifying all sorts of interactions between stakeholders in a game. The most popular setup is the prisoner's dilemma but there is much more to it. Today, we will cover the Shapley value as I recently stumbled across this original yet relatively unknown concept.

Problem at stake:

Example 1: In a salary negotiation, the employee presents their skills and what they bring to the company. But how much are these skills worth?

Example 2: In a joint venture, each founding company contributes expertise. What’s a fair way to distribute ownership or shares?

Example 3: Two or three telecom companies want to build a fiber network that would benefit all. What’s a fair way to split the costs?

When we think about these problems, most of us might guess: “You should ask for an X% raise because you deserve it,” but there is actually a theory for this.

Let’s explore the theory behind it.

Solution:

This is known as a cooperative game, and there is only one breakdown function that meets a few conditions (more on this later).

The main idea: “Player A's fair reward is the average of their marginal contributions to the different coalitions leading to the final setup,” where:

For a game with 3 players (A, B, C), we define:

Final setup: the final set of stakeholders S {A, B, C}
Coalition: a subset of S
Marginal contribution: Adding A to {A, B} is: Value {A, B, C} – Value {B, C}

Now, let’s introduce some values:

V(A) = 12
V(B) = 8
V(C) = 2
V(A, B) = 22
V(A, C) = 15
V(B, C) = 11
V(A, B, C) = 23

The Shapley value of A is calculated as:

Generalization

From there, you can intuit the general formula where n! is the total number of permutations :

Where K \ A notes the coalition K without A.

What current problem can you apply the Shapley value to?

Branch Misprediction in Humans

Guillaume Guy — Tue, 15 Apr 2025 16:17:19 GMT

Branch prediction refers to a CPU's ability to predict the outcome of a boolean statement to streamline processing. This concept parallels human decision-making, where people continuously assess potential outcomes based on experience and inference, often leading to errors or mispredictions. Such errors can result from incorrect priors and can be challenging to correct, similar to the sunk cost fallacy in human behavior.

What is branch prediction?

Brand prediction is a relatively unknown term to non-CS folks. It describes the ability for modern CPU to predict the value of a boolean statement and proceed ahead while verifying the aforementioned statement in parallel. It usually speeds up computations but hurts when the problem is intrinsically difficult to predict (think coin flip).

There is a famous Stack Overflow answer with 33k upvotes that covers an easy-to-understand use case for those who want to learn more.

How does it relate to humans?

As it turns out, humans are equally trying to predict things in our day to day life. Is this person a threat? Can I park here without getting a ticket? When assessing these questions, one would usually comb through prior experience using some type of Bayesian inference to get to an estimate. For instance, guns increase the risk whereas some locations (e.g. playground) reduce it. Humans always compute the odds in their background process, which only triggers an alarm when the computed risks crosses a certain threshold.

The problem is that Humans make mispredictions too. Priors may be informed by past experience that may not be representative, or worse, built upon some beliefs that are plain false. In such case, one may misconstrue certain animals to be dangerous; and worse, not offer a feedback loop that correct this imperfect belief. As for CPUs, Human branch mispredictions can therefore be costly. And we have a name for it: Sunk cost fallacy (wikipedia). And yes, it’s hard to correct.

Should we ship this feature?

Guillaume Guy — Sun, 01 Aug 2021 17:00:00 GMT

Introduction

During my time at tech companies, one of the most challenging tasks for my team has been guiding the Product team in their decision to launch a feature. This process is both highly consequential and fraught with potential pitfalls. Let's consider an example to ease into the topic:

You've worked hard to convince leadership to measure the impact of a cool new feature, and you've set up a well-designed experiment. After four weeks, you observe a statistically significant improvement in your primary metric, while secondary metrics remain neutral. The story seems clear: you should tell the team to ship it, right?

The problem with shipping features

Once the results are in, the focus often shifts to outcomes, and leadership is incentivized to ship. Consequently, the cost side of the equation receives little attention:

Tech Debt: Does this feature increase system complexity? Does it hinder long-term growth by imposing a tax on any new feature? Consider the Net Present Value of such a tax. This aspect is frequently overlooked, leaving some teams burdened with significant disadvantages, resulting in long development cycles when one could reasonably expect 2-3 times faster cycles.
Long-term Effect: Experiments measure short-term effects on a population. They do not capture behaviors that take time to materialize, such as competitors' reactions, brand effects, or customer word-of-mouth on "circumventing the feature" in cases of fraud.
False Positives: A low p-value does not address false positives. Instead, it describes the probability of observing a change as extreme as the one you are seeing under the null hypothesis. What is the probability of a false positive given your observations?

Thinking about Tech Debt

Tech debt is primarily shouldered by Engineering and sometimes by Data Science. A strong partnership with Engineering is necessary to assess it accurately. Here are a few probing questions to consider: Is the system more complex? How so? Can you help me understand your development velocity before and after? Can you quantify it? For example, if you had a single model before and this feature adds another model, training time might increase by 100%, resulting in a ~20% longer development cycle. Is the system more fragile? Do you have more interfaces, special cases, or reliance on data pipelines/APIs that may become obsolete?

Fragility is an interesting topic that is easy to underestimate. If you add three components with a 1/1000 chance of failing in a day, the probability that your system will fail at least once over a year is 66% (yes, more likely to happen than not).

Rethinking leadership incentive

At the core of this issue is incentive: a team is seen as successful if it ships features, and so is a leader of an organization. Since tech debt is difficult to quantify and gradually impacts team productivity, team leadership often disregards it.

While Data Science cannot change the incentive structure alone, it can play a significant role in highlighting these trade-offs by leveraging its institutional status as a "guide to decision making." Most leaders are reasonable and strive to do right by their team. They understand these trade-offs as long as they are presented clearly and early. Tactically, individuals at my level (DS manager) can initiate discussions by requesting an analysis before launching the experiment (by including that component in their experiment spec) and reaching a consensus with PM/Eng on shared launch criteria.

Quantifying False Discovery Rate

To address false positives, you need to dig deeper into the definitions of power and false discovery. You can write four equations with four unknown variables (TP, FP, TN, FN):

Power: Ability to detect small changes
p-value: Acceptance of False Positives (the higher, the more “accepting of FP” the experimenter is)
Ship rate: % of features shipped
SUM(TP ... FN) = 1

Leading to this matrix:

Inverting the matrix solves the problem. In particular, we are interested in FP / (FP + TP) which gives you P(FP | shipped), that is, the False Discovery Rate:

You can see that you should not be overly concerned about false discoveries unless you are conducting numerous experiments and "seeing what sticks" (a form of p-hacking). At a 50% ship rate, only 1.5% of all experiments shipped are false positives (for p=0.05), which seems reasonable. If your ship rate falls below this threshold, it may be time to recommend a lower p-value threshold.

Code: here

Good luck!

My new role at Lyft (2018)

Guillaume Guy — Mon, 01 Oct 2018 17:00:00 GMT

In October, I joined Lyft as Data Science Manager for Core Mapping. I wish I could have posted this update a while ago but a big event got in the way (yes, we went public) ...

While low visibility, Mapping turns out to a big deal for ride-sharing as it has influence on a lot of other services. In a nutshell, Mapping has an impact on pricing, driver dispatch, scheduled rides and customer XP. Also, it is usually the biggest friction point for seamless pickups.

Mapping has traditionally 4 components: Basemap (representation of the world as a graph), Locations (where are drivers, passengers ?), Routing (optimal paths between locations) and ETA (distance and time between locations).

When you open your app, you see the following screen:

A lot of information displayed is connected to the work by my team:

The surrounding physical world: road segments, (train) stations, POI
The pick-up ETA (3min) looks at drivers around the PIN and compares it to demand. This gives an estimate how fast we can get a car to you
The drop-off ETA (10:03) estimates when we can get you to your destination
The price ($36.66) which uses a lot of indicators including dropoff ETAs to make the ride fair for both parties: riders and drivers
The polyline (purple) showing the route to your destination

Once you click on "Request", the application will dispatch a driver to you. This decision is central to the efficiency of the platform and therefore very sophisticated. Getting a driver to you ASAP (using our ETAs) is of critical importance to make for the best user experience. Here, there are usually two ways of solving it: the greedy way (dispatching the closest driver) and the optimal way (solving the problem as a global minimization problem).

Exciting problems in Core Mapping

What are some of the problems in Mapping that are very interesting in nature ? This is a broad question and I will just sample it down a few:

How does driver location influence ETA?
We collect GPS signal from our drivers by way of streaming to our services. There is a chance that it can be wrong: For instance, snapping the driver to the wrong road segment often occurs is urban canyon.

In the example below, what if the driver is falsely matched to Bryan Street but is actually about to enter I-80 ? Dispatching this driver would arguably be disastrous: The driver getting on I-80 may have to drive all the way to Oakland and back ! A small map match variation (in distance) has a non-linear relationship with ETA.

How does ETA influence Dispatch?
To provide the best user experience, dispatch uses pairwise (DVR,PAX) ETA to make the best Supply <==> Demand match. In the ideal state, dispatch minimizes the SUM(duration) across all dispatches. However, we only measure the dispatches that occurred and have little information about those that did not materialize. Gaining a better understanding of what leads to "Bad dispatch" (Whether false positive or false negative) is a central topic for my team and for Lyft !

As Lyft speeds up into 2020, Mapping will become an even more important topic e.g. for multi modal transportation. Reach out to me if you want to join the team !

Guillaume Guy

What is CLIP and why does it matter?

Introduction

Modality alignment

Why does CLIP matter?

Conclusion

How did LLMs gain Vision?

Introduction

Image as Cross-Encoded signals

Image as a dense token

What matters when training a VLM?

Conclusion

The state of Self-Supervision for Vision

Introduction

Selective history of Self-supervision

1- Unsupervised Visual Representation Learning by Context Prediction, 2015

2- MOCO: Momentum Contrast for Unsupervised Visual Representation Learning, 2020

3- SimCLR, 2020

4- Dino, 2021

Conclusion

Don't use raw embeddings

Introduction:

Vector Quantization:

CPU/GPU Friendliness:

Going Further:

Takeaway:

Focus: Shapley value

Problem at stake:

Solution:

Generalization

Branch Misprediction in Humans

What is branch prediction?

How does it relate to humans?

Should we ship this feature?

Introduction

The problem with shipping features

Thinking about Tech Debt

Rethinking leadership incentive

Quantifying False Discovery Rate

My new role at Lyft (2018)

Why is Mapping important for ride sharing company ?

Exciting problems in Core Mapping