Task-agnostic feature upsampling has emerged as a way to obtain
dense visual features from pre-trained backbones without paying
the full quadratic cost of high-resolution self-attention.
Instead of increasing the token count in the backbone, these
methods map low-resolution features to high-resolution ones
using separate, smaller networks.
We introduce UPLiFT (Universal Pixel-dense
Lightweight Feature Transforms), an iterative upsampler that
revisits early convolutional approaches and shows they can match
or surpass recent cross-attention-based methods at lower
inference cost. UPLiFT is built around a
Local Attender, a fully local attentional
pooling operator that aggregates features over a fixed
neighborhood using learned weights, avoiding global
query–key–value attention while still preserving the backbone’s
feature distribution.
UPLiFT produces semantically stable, pixel-dense features from
ViT backbones such as DINOv2, achieving state-of-the-art
performance on semantic segmentation and monocular depth while
maintaining linear time and memory scaling. It also extends
naturally to generative tasks by upsampling VAE latents, where
it attains competitive image generation and super-resolution
quality compared to Coupled Flow Matching models, despite using
fewer parameters, less training data, and fewer upsampling
steps.