Advent of Code 2022 - Day 9 - part 1

30 Nov 2023

To get ready for Advent of Code 2023, I continued where I stopped last year: day 9.

Here’s me struggling for 1 hour and 40 minutes spread across two days, because I tried to be clever. For viewing pleasure it has been sped up.

What goes wrong here is that I assumed incorrectly that a negative number to the power of zero would be -1. Quick maths turned out not to be good maths. The following bit of experimentation led me to believe that.

result = -5 ** 0  
assert result == -1

You may have already guessed that the order of operation fooled me. The ** operator is executed before the - operator is applied to the result. The behaviour is different if I do the same using a variable:

x = -5
result = x ** 0
assert result == 1

The solution that eventually led to the correct answer is in the code block below. After submitting the correct answer, I did some cleanup:

deleted some unreachable code
linted it
removed some debug lines
and added some more comments
deleted the x ** 0 and y ** 0 since they are pointless now.

from advent_of_code import *
import requests_cache

requests_cache.install_cache()

test_input = """R 4
U 4
L 3
D 1
R 4
D 1
L 5
R 2"""
input_9 = fetch_input(9)
# input_9 = test_input
movements = [x.split(' ') for x in input_9.splitlines()]
movements = [(x[0], int(x[1])) for x in movements]
visited = set()
head = (0, 0)
tail = (0, 0)

directions = {
# What coordinates change for each movement
#         x, y
    "U": (1, 0),
    "D": (-1, 0),
    "L": (0, -1),
    "R": (0, 1),
}


def modify_location(location, direction):
    if isinstance(direction, str):
        change = directions[direction]
    else:
        change = direction
    return location[0] + change[0], location[1] + change[1]


def direction_to_move(head, tail):
    x = head[0] - tail[0]
    y = head[1] - tail[1]
    
    # head and tail are at the same location, don't move
    if x == 0 and y == 0:  
        return 0, 0
    
    # head and tail are less than one square apart (including diagonally)
    elif max(abs(x), abs(y)) == 1:  
        return 0, 0
    
    # head and tail are too far apart, decide which direction to move the tail
    else:
        if x < 0:
            x = x ** 0 * -1
        else:
            x = 0 if not x else 1
        if y < 0:
            y = -1
        else:
            y = 0 if not y else 1
        return x, y


tail_visited = set()
for direction, length in movements:
    for _ in range(length):
        head = modify_location(head, direction)
        tail = modify_location(tail, direction_to_move(head, tail))
        # keep track of where the tail has been:
        tail_visited.add(tail)

submit_answer(level=1, day=9, answer=len(tail_visited))

The lessons I learned: Don’t try to be clever and check your maths.

No more 429: Combining ratelimit and requests_cache

27 Oct 2023

requests_cache is nice. ratelimit is nice. But they don’t play nicely together yet: If a request is coming from the cache that requests_cache maintains, ratelimit doesn’t “know that” and will still slow your script down for no reason. That’s why I published the ratelimit_requests_cache module. It offers a similar rate limiter to the ratelimit module, but invocations only count towards the rate limit if the request could not be served from the cache.

The usage is the same as the normal ratelimit package. You decorate a method with the sleep_and_retry and a limiting decorator, in this case the limits_if_not_cached:

import requests
import requests_cache
from ratelimit import sleep_and_retry
from ratelimit_requests_cache import limits_if_not_cached


@sleep_and_retry
@limits_if_not_cached(calls=1, period=1)
def get_from_httpbin(i):
    return requests.get(f'https://httpbin.org/anything?i={i}')


# Enable requests caching
requests_cache.install_cache()

# Notice that only the first ten requests will be ratelimited to 1 request / second
# After that, it's a lot quicker since requests can be served from the cache
# and the ratelimiter does not engage
for i in range(100):
    get_from_httpbin(i % 10)
    print(i)

See it in action:

This rate limiter is ideal for when an API call is expensive, measured in either time or in money. HTTP requests have to be performed only once, and you can better avoid getting HTTP 429 status codes.

This rate limiter checks whether a request was served from the cache or not by checking the .from_cache attribute of the Response. That means that if you have a different caching mechanism, you could also set this .from_cache boolean attribute and use the decorator for other purposes just as easily.

To start using it, get it from PyPI:

python3 -m pip install ratelimit_requests_cache

Making mitmproxy more easily debuggable

04 Oct 2023

Mitmproxy is just what it says on the tin: a proxy that can act as a man-in-the-middle. By default it will re-sign HTTPS traffic with its own root CA. It can also modify other requests in-place, using Python hooks.

In this post I show how I add a main to mitmproxy hook scripts themselves. This way both your hook and the mitmproxy invocation are contained within one file. I think a Python module without an if __name__ == '__main__' always is a bit of a missed opportunity. Even if a module is nested deep into your application, it might still be a suitable place to write some examples how to use the code in the module.

Normally, when you run a mitmproxy and want to set some hooks, you supply the script as a command line argument to the CLI tool.

mitmdump -q -s intercept.py
# or for the Command Line UI:
mitmproxy -s intercept.py

But, when running mitmproxy from the command line, you will not actually the __main__ of the mitmproxy module. The place where the CLI tool actually lives in a code base can usually be found in setup.py or pyproject.toml. There is often a parameter or section called scripts, console_scripts, or something similar, depending on the packaging tools. For mitmproxy, it was in pyproject.toml in the project.scripts section:

[project.scripts]
mitmproxy = "mitmproxy.tools.main:mitmproxy"
mitmdump = "mitmproxy.tools.main:mitmdump"
mitmweb = "mitmproxy.tools.main:mitmweb"

In the code below I import the method that contains the CLI tool. I also use the special __file__ variable present in Python. This contains the full filename of the script it’s called from.

from mitmproxy import http
from mitmproxy.tools.main import mitmdump


def websocket_message(flow: http.HTTPFlow):
    """ Hook on websocket messages """
    last_message = flow.websocket.messages[-1]
    print(last_message.content)


if __name__ == "__main__":
    mitmdump(
        [
            "-q",      # quiet flag, only script's output
            "-s",      # script flag
            __file__,  # use the same file as the hook
        ]
    )

This way of adding a main is a bit similar to what I did earlier with streamlit. That solution turned out to have some unforeseen implications: Streamlit was convinced a form was nested in itself. So, stay tuned for the trouble that this mitmproxy hack might cause later.

OpenAI's CLIP inference in C# using ONNX Runtime

29 Jul 2023

CLIP is a model developed by OpenAI (back in 2021), that can create embeddings for both text and images. These embeddings exist in the same vector space and can be compared across the two modalities. Contrary to some other OpenAI models, the weights are freely available.

The official implementation released by OpenAI is in Python. I needed to calculate CLIP vectors in C# however. To make the C# implementation, I build from the work of josephrocca, who ported the model from Torch to ONNX. Although he initially ported it to be able to use it in Javascript, we can reuse these weights in C#. The nice thing about the ONNX Runtime is that it is available for many programming languages, and the models and weights are compatible.

I’m planning to create a proper library here: clip.dll. In the future there will also be support for vectorizing text in that library. However, for the time being, the implementation is just the following snippet:

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text.Json;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using SixLabors.ImageSharp;
using SixLabors.ImageSharp.Processing;
using SixLabors.ImageSharp.PixelFormats;


class CLIP {
    static void Main(string[] args) {
        // Download the model weights if we don't have them in the current directory
        if (!File.Exists("clip-image-vit-32-float32.onnx"))
        {
            WebClient webClient = new WebClient();
            webClient.DownloadFile(
                "https://huggingface.co/rocca/openai-clip-js/resolve/main/clip-image-vit-32-float32.onnx",
                @"clip-image-vit-32-float32.onnx"
            );
        }

        // Load the model
        // Model sourced from: https://huggingface.co/rocca/openai-clip-js/tree/main
        var clipModel = new InferenceSession("clip-image-vit-32-float32.onnx");

        // Load an image specified as a command line argument
        var image = Image.Load<Rgba32>(File.ReadAllBytes(args[0]));

        // Calculate the shortest side, and use that to extract a square from the center
        // Known in other image libraries as Centercrop
        // AFAIK Centercrop is not available in Sixlabors.ImageSharp, so we do it manually
        var smallestSide = Math.Min(image.Width, image.Height);
        image.Mutate(x => x.Crop(
            new Rectangle(
                (image.Width - smallestSide) / 2,
            (image.Height - smallestSide) / 2,
            smallestSide,
            smallestSide
        )));

        // Resize to 224 x 224 (bicubic resizing is the default)
        image.Mutate(x => x.Resize(224, 224));

        // Create a new array for 1 picture, 3 channels (RGB) and 224 pixels height and width
        var inputTensor = new DenseTensor<float>(new[] {1, 3, 224, 224});

        // Put all the pixels in the input tensor
        for (var x = 0; x < 224; x++)
        {
            for (var y = 0; y < 224; y++)
            {
                // Normalize from bytes (0-255) to floats (constants borrowed from CLIP repository)
                inputTensor[0, 0, y, x] = Convert.ToSingle((((float) image[x, y].R / 255) - 0.48145466) / 0.26862954);
                inputTensor[0, 1, y, x] = Convert.ToSingle((((float) image[x, y].G / 255) - 0.4578275 ) / 0.26130258);
                inputTensor[0, 2, y, x] = Convert.ToSingle((((float) image[x, y].B / 255) - 0.40821073) / 0.27577711);
            }
        }

        // Prepare the inputs as a named ONNX variable, name should be "input"
        var inputs = new List<NamedOnnxValue> {NamedOnnxValue.CreateFromTensor("input", inputTensor)};

        // Run the model, and get the output back as an Array of floats
        var outputData = clipModel.Run(inputs).ToList().Last().AsTensor<float>().ToArray();

        // Write the array serialized as JSON
        Console.WriteLine(JsonSerializer.Serialize(outputData));
    }
}

Updated 2024-03-29: Added cropping a square out of the center to match the reference implementation

Why pip and Homebrew make a dangerous cocktail

20 Jul 2023

`pip install` will happily replace anything in `/usr/local/`

pip, a Python package manager, can install two types of Python packages: A source distribution and a binary distribution.

With the source distribution, it’s simply running the setup.py which often contains a call to setuptools.setup. You could argue that pip installing source distributions is RCE by design.

Binary distributions (often wheels) are not intended to run code immediately during install. They simply copy files using all kinds of logic defined in pip itself. This should make them less dangerous.

Homebrew is “The Missing Package Manager for macOS”, and probably the most popular way for MacOS users to get Python: Python 3.* was brew installed around 500,000 times in the last 30 days. “Homebrew installs packages to their own directory and then symlinks their files into /opt/homebrew (on Apple Silicon).” Note that this is /usr/local/ on Intel Macs. “Homebrew won’t install files outside its prefix and you can place a Homebrew installation wherever you like.”

In this post I’ll assume a Python 3.9 installation, performed with brew install [email protected]. I’ll demonstrate that a malicious Python package can replace files in the Homebrew prefix directory, by default /usr/local for Intel Macs and /opt/homebrew/ for ARM Macs.

By defining the following setup.py we could even replace the python3.9 executable itself. I replace python3.9 with an executable that simply outputs Not Python to demonstrate the issue:

from setuptools import setup

setup(name='malware',
      version='3.2.1',
      description='malware',
      url='https://example.com',
      author='',
      author_email='[email protected]',
      # Every file in /usr/local can be poisoned by including data_files.
      # If they already existed, the executable flag is preserved
      # This is just one example of a file that can be replaced:
      data_files=[("Cellar/[email protected]/3.9.17_1/bin", ["python3.9"])],
      packages=[],
      install_requires=[])

In a video this looks like this:

If a data file has the same path as an existing file, and the existing file has executable bits set, they will remain set!

Furthermore, real world attacks will be much more subtle than the one above. An attacker could patch some malware into a dynamic library while preserving its original functionality. Packages like lief will help you do that with ease.

Of course, before posting this here, I tried to find out how well known this issue is. After some discussion with the people running the security mailing lists at Python and the Python Packaging Authority, the conclusion is: There is no fix for this, at least not without breaking someone’s legitimate uses.

It is possible to think of some countermeasures to reduce the risk, however. One countermeasure could be screening packages for executables and libraries in places they don’t typically belong. This can be done in three steps:

Download the package and its dependencies using pip download, without installing them. The --only-binary=:all: is important because pip download will run a source distribution to find out its dependencies.

mkdir /tmp/scan
python3 -m pip download --only-binary=:all: -d /tmp/scan package_name

Apply the following script to detect any executables or libraries in places they don’t belong (using libmagic):

import glob
import zipfile

import magic  # brew install libmagic && python3.9 -m pip install python-magic
from tqdm import tqdm  # python3.9 -m pip install tqdm

for wheel in tqdm(list(glob.glob("/tmp/scan/*.whl"))):
    wheel = zipfile.ZipFile(wheel)
    for file in wheel.filelist:
        # this is an indication that data was added using data_files:
        if '.data/data/' in file.filename:
            # let libmagic find out what it is:
            magic_guess = magic.from_buffer(wheel.open(file.filename).read())
            # change this when not on MacOS to something relevant to your platform:
            if 'Mach-O' in magic_guess:  
                print(wheel.filename, file.filename, magic_guess)

Note that in this example I only check for Mach-O libraries and executables. This reduces false positives. For example: If a package author sets include_package_data=True there will be lots of Python files in the data directory as well. This scanning also does not check for shell scripts or other types of executables. To cast a wider net, check for the words 'executable' or 'library' in the output of libmagic.

Inspect the output of step 2, and decide if you still want to run pip install on the same target and its transitive dependencies.

Of course, it’s best to combine this scan-before-you-install with some long-standing best practices:

Firstly, avoid installing pip source distributions when possible, using the --only-binary=:all: parameter.

Secondly, follow the recommendations here, and configure Homebrew to use a directory not on the PATH. This means disregarding Homebrew’s own documentation, warning that it might be inconvenient not to have this set to /usr/local.

To summarize: installing things always introduces an inherent risk, but the way Homebrew’s Python is configured might pose an additional risk.

Older Newer