OpenAI's CLIP inference in C# using ONNX Runtime

CLIP is a model developed by OpenAI (back in 2021), that can create embeddings for both text and images. These embeddings exist in the same vector space and can be compared across the two modalities. Contrary to some other OpenAI models, the weights are freely available.

The official implementation released by OpenAI is in Python. I needed to calculate CLIP vectors in C# however. To make the C# implementation, I build from the work of josephrocca, who ported the model from Torch to ONNX. Although he initially ported it to be able to use it in Javascript, we can reuse these weights in C#. The nice thing about the ONNX Runtime is that it is available for many programming languages, and the models and weights are compatible.

I’m planning to create a proper library here: clip.dll. In the future there will also be support for vectorizing text in that library. However, for the time being, the implementation is just the following snippet:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text.Json;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using SixLabors.ImageSharp;
using SixLabors.ImageSharp.Processing;
using SixLabors.ImageSharp.PixelFormats;

class CLIP {
    static void Main(string[] args) {
        // Download the model weights if we don't have them in the current directory
        if (!File.Exists("clip-image-vit-32-float32.onnx"))
            WebClient webClient = new WebClient();

        // Load the model
        // Model sourced from:
        var clipModel = new InferenceSession("clip-image-vit-32-float32.onnx");

        // Load an image specified as a command line argument
        var image = Image.Load<Rgba32>(File.ReadAllBytes(args[0]));

        // Resize to 224 x 224 (bicubic resizing is the default)
        image.Mutate(x => x.Resize(224, 224));

        // Create a new array for 1 picture, 3 channels (RGB) and 224 pixels height and width
        var inputTensor = new DenseTensor<float>(new[] {1, 3, 224, 224});

        // Put all the pixels in the input tensor
        for (var x = 0; x < 224; x++)
            for (var y = 0; y < 224; y++)
                // Normalize from bytes (0-255) to floats (constants borrowed from CLIP repository)
                inputTensor[0, 0, y, x] = Convert.ToSingle((((float) image[x, y].R / 255) - 0.48145466) / 0.26862954);
                inputTensor[0, 1, y, x] = Convert.ToSingle((((float) image[x, y].G / 255) - 0.4578275 ) / 0.26130258);
                inputTensor[0, 2, y, x] = Convert.ToSingle((((float) image[x, y].B / 255) - 0.40821073) / 0.27577711);

        // Prepare the inputs as a named ONNX variable, name should be "input"
        var inputs = new List<NamedOnnxValue> {NamedOnnxValue.CreateFromTensor("input", inputTensor)};

        // Run the model, and get the output back as an Array of floats
        var outputData = clipModel.Run(inputs).ToList().Last().AsTensor<float>().ToArray();

        // Write the array serialized as JSON