UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

1University of Maryland, College Park   2Meta

UPLiFT is a lightweight, iterative feature upsampler that converts coarse ViT and VAE features into pixel-dense representations using a fully local attention operator. It preserves semantics while scaling linearly in the number of tokens, enabling efficient dense prediction and high-resolution generation.

Abstract

Task-agnostic feature upsampling has emerged as a way to obtain dense visual features from pre-trained backbones without paying the full quadratic cost of high-resolution self-attention. Instead of increasing the token count in the backbone, these methods map low-resolution features to high-resolution ones using separate, smaller networks.

We introduce UPLiFT (Universal Pixel-dense Lightweight Feature Transforms), an iterative upsampler that revisits early convolutional approaches and shows they can match or surpass recent cross-attention-based methods at lower inference cost. UPLiFT is built around a Local Attender, a fully local attentional pooling operator that aggregates features over a fixed neighborhood using learned weights, avoiding global query–key–value attention while still preserving the backbone’s feature distribution.

UPLiFT produces semantically stable, pixel-dense features from ViT backbones such as DINOv2, achieving state-of-the-art performance on semantic segmentation and monocular depth while maintaining linear time and memory scaling. It also extends naturally to generative tasks by upsampling VAE latents, where it attains competitive image generation and super-resolution quality compared to Coupled Flow Matching models, despite using fewer parameters, less training data, and fewer upsampling steps.

UPLiFT in Predictive and Generative Tasks

UPLiFT is designed as a task-agnostic feature upsampler and can be plugged into both discriminative and generative pipelines without modifying the underlying backbone or generator.

  • Predictive tasks: UPLiFT upsamples DINOv2 features for semantic segmentation and monocular depth estimation, improving mIoU and depth accuracy over prior feature upsamplers while running faster than recent cross-attention-based methods.
  • Generative tasks: Applied to Stable Diffusion VAEs, UPLiFT upsamples latent codes for efficient text-to-image upscaling and 4× image super-resolution, reaching quality comparable to CFM with significantly lower compute and fewer parameters.

Method Overview

UPLiFT follows an iterative upsampling design: a single compact decoder is applied multiple times to grow coarse feature maps to pixel density, guided by shallow, high-resolution encoder features from the input image. At each step, a Local Attender enforces consistency with the original backbone features while only accessing a small neighborhood around each token.

Key Components

  • UPLiFT Encoder ($E_{\text{UPLiFT}}$): A shallow convolutional encoder that processes the input image once and outputs dense, high-resolution guide features. These features are downsampled via nearest-neighbor for each upsampling step, avoiding repeated encodings at larger resolutions.
  • UPLiFT Decoder ($D_{\text{UPLiFT}}$): A lightweight convolutional decoder trained to perform 2× upsampling. The same module is reused across steps to grow low-resolution backbone features to pixel-dense maps.
  • Local Attender: A local attention operator that uses the guide features to predict attention weights over a fixed offset neighborhood around each low-resolution token, and then linearly recombines value features from the backbone. This preserves the backbone’s feature distribution while avoiding global attention.
  • Multi-step training: UPLiFT is trained with a feature reconstruction loss at multiple resolutions, encouraging stability across all intermediate upsampling stages.

Local Attender Operator

Inputs: Guide feature map $G \in \mathbb{R}^{H_g \times W_g \times C_G}$, value feature map $V \in \mathbb{R}^{H_v \times W_v \times C_V}$, neighborhood offsets $\mathcal{N}$

Output: Upsampled feature map $Y$ aligned with $G$

1: Project $G$ with a $1 \times 1$ convolution to logits $A \in \mathbb{R}^{H_g \times W_g \times |\mathcal{N}|}$
2: Apply softmax over the neighborhood dimension to obtain attention weights $\alpha$
3: For each spatial position $(x, y)$ in $G$, gather local value features $\{V_{x+i, y+j} : (i,j) \in \mathcal{N}\}$ (with padding) from $V$
4: Compute $Y_{x,y} = \sum_{k=1}^{|\mathcal{N}|} \alpha_{x,y,k} \, V_{x+i_k, y+j_k}$
5: Return $Y$ as the locally attended, upsampled value feature map

Because the neighborhood size $|\mathcal{N}|$ is fixed, the computational and memory cost of the Local Attender scales as $\mathcal{O}(|\mathcal{N}| \cdot T)$, where $T$ is the number of spatial tokens, yielding linear scaling in the number of patches.

Results

Linear scaling: UPLiFT’s inference time grows roughly linearly with visual token count, while recent cross-attention-based upsamplers exhibit quadratic scaling and hit memory limits at lower resolutions.

Predictive Tasks: Segmentation and Depth

Using DINOv2-S/14 as the backbone and training only linear probes on top of upsampled features, UPLiFT achieves higher semantic segmentation performance than prior feature upsamplers across COCO-Stuff, VOC, ADE20K, and Cityscapes, while maintaining lower latency than recent cross-attention-based alternatives.

For monocular depth on COCO-Stuff, UPLiFT attains competitive thresholded accuracy and the lowest or near-lowest RMSE among all methods, indicating that local upsampling with the Local Attender still captures the global structure required for depth reasoning.

Generative Tasks: Text-to-Image Upscaling and Super-Resolution

High-resolution generation: UPLiFT upsampling of Stable Diffusion latents yields 2048×2048 images with visual quality comparable to CFM, but with fewer parameters, less training data, and fewer upsampling steps.

On COCO and reLAION benchmarks, UPLiFT improves FID and related metrics for 512→1024 text-to-image upscaling compared to CFM, while running faster. For 4× super-resolution on FacesHQ and LHQ, UPLiFT provides strong SSIM and PSNR with only two upsampling steps and latency close to simple bilinear upsampling in latent space.

Citation

If you find our work useful in your research, please consider citing:

@article{walmer2025uplift,
  title   = {UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders},
  author  = {Matthew Walmer and Saksham Suri and Anirud Aggarwal and Abhinav Shrivastava},
  journal = {arXiv preprint},
  year    = {2025}
}

Note: We will update the bibliographic details (venue, year, arXiv ID) once they are finalized.