You are here
Primary tabs
Container Registry proxy
Continuing the effort to make commonly used softwares and software packages off-grid available, I put together a caching proxy application based on nginx to cache Docker images pulled at least once.
The lack of general (OCI) container registry caching solutions was worrysome. Most of them, if not all, can mirror packages only from the primary Docker Hub, registry-1.docker.io, or is a full-blown complex artifactory… in Java. I asked myself, what can be so complex in a "docker pull" which prevent a simple general-purpose HTTP-forward proxy to cache blobs to serve docker images to LAN clients which were previously downloaded by an other node in the LAN, ie. a pull-through proxy?
Naturally I went to nginx, specificly to the proxy_store directive which provides the core functionality. It could be proxy_cache, which is in many aspect more powerful, finer grained, respects cache expiry and validation, can store non-idempotent requests (POST) too – which I did abuse earlier to save costs to a limited API which served effectively idempotent resources via POST requests. But I prefer simplicity and these resources constituting OCI images are addressed by their hash, so don't expect many cache items to be refreshed anyways, except image tags.
Inspecting the HTTP communication during a docker pull revealed the most annoying part which makes harder to design this off-grid cache: it seems it's part of the pull mechanism to get one or more access tokens for a pull session at the start, ie. when first talk to the Registry. This makes a dillemma: if i let the client through to get a real token from the upstream, it takes long (I understand the registry provider wants to align resources well to serve everyone more or less equally, that's why they hand out tokens; but why delay as soon as the token exchange? would not it be just as effective to delay overconsuming clients at the first "data" access?), on the other hand if i give out a dummy token just to make docker pull clients happy, it will fail at the first cache MISS as the request won't have the real token to be authorized to the upstream. An idea was to run a background process to always keep a valid token around, but first it's wasteful until I don't have a ton of dockers spinning up all the day, secondly these tokens are scoped to repos (eg. library/busybox) which are many. So I sayed optimistic and set a sane proxy_connect_timeout to the upstreams and serve a dummy access token on error, hoping that everything is cached which the client will ask for.
Caching redirection responses (HTTP 30x) has been an other point would have suggested to use proxy_cache if I had anticipated it. But the discovery that the ngx_http_perl module is part of the default distro raptured me with itself. It was not vain: turned out I should not get any redirection response from cache because the pointed resource (ie. Location: URL...) might not be in cache, plus these redirections are (so far all) to a storage provider with a short-term access token in the URL, so it will be definitely expired when we serve the cached version. Thus the $stored_redirect (so I named the variable indicating whether we got a http 30x on the first request to a perticular URL) comes from cache only if we also have the file it is redirecting to.
Preserving Content-Type was also a critical part of making the whole proxy mechanism work for docker clients. It is sensitive to Content-Type. To all 2 of them: the JSON payloads i encountered with can be image index or manifest type.
This brought me to define this directory structure under the document_root:
- content/<DOMAIN>/<PATH>
- the actual HTTP payloads
- content-type/<DOMAIN>/<PATH>
- Content-Type headers per URL
- worth to put URLs in
content-type/<MAIN>/<SUB>files line-by-line to save fs inodes
- redirect/<DOMAIN>/<PATH>
- HTTP status codes (301, 302, 307, ...) and
- Location headers
- www-authenticate/<DOMAIN>/<PATH>
- WWW-Authenticate headers
- docker-content-digest/<DOMAIN>/<PATH>
- Docker-Content-Digest headers
- not strictly required but save requests when pull images by tags (latest, v1.1, ...)
- cr-proxy
- files served by the proxy alone, eg.
dummy-token.json
- files served by the proxy alone, eg.
- temp
proxy_store's in-progress files
There are other points too which one need to cinsider configuring systems like this: namely HTTPS. Docker forcefully wants to communicate over https, even when you set https_proxy environment which must suggest that you are in control of the communication at least up to the proxy you yourself set, so I would expect the client software (docker in this case) to respect your choice of not wanting to hassle with MITM-ing yourself in order to strip the SSL/TLS layer to able to inspect the traffic the software you run generates for you. But no, you either list all registries on earth in docker.json insecure-registries option or strip TLS right before the cr-proxy. You need to touch clients machines anyway to set https_proxy environment for the docker daemon, so you are already there to install your mitmproxy's CA cert. Unless of course if you are so nerd to make the whole story a transparent proxy; but in this case you roll out your CA certs somehow too.
Give it a glance if you need a general Container Registry proxy which mirrors images from any other upstream docker registry, not just a dedicated one.
https://codeberg.org/hband/cr-proxy
- 30 reads