SaLon3R: Structure-aware Long-term Generalizable 3D Reconstruction from Unposed Images

Arxiv 2025

Jiaxin Guo1,2, Tongfan Guan1, Wenzhen Dong2, Wenzhao Zheng3, Wenting Wang1, Yue Wang4, Yeung Yam1, Yun-Hui Liu1,2
The Chinese University of Hong Kong1 Hong Kong Center for Logistics Robotics2 University of California, Berkeley3 Zhejiang University4

Abstract

Recent advances in 3D Gaussian Splatting (3DGS) have enabled generalizable, on-the-fly reconstruction of sequential input views. However, existing methods often predict per-pixel Gaussians and combine Gaussians from all views as the scene representation, leading to substantial redundancies and geometric inconsistencies in long-duration video sequences. To address this, we propose SaLon3R, a novel framework for Structure-aware, Long-term 3DGS Reconstruction. To our best knowledge, SaLon3R is the first online generalizable GS method capable of reconstructing over 50 views in over 10 FPS, with 50% to 90% redundancy removal. Our method introduces compact anchor primitives to eliminate redundancy through differentiable saliency-aware Gaussian quantization, coupled with a 3D Point Transformer that refines anchor attributes and saliency to resolve cross-frame geometric and photometric inconsistencies. Specifically, we first leverage a 3D reconstruction backbone to predict dense per-pixel Gaussians and a saliency map encoding regional geometric complexity. Redundant Gaussians are compressed into compact anchors by prioritizing high-complexity regions. The 3D Point Transformer then learns spatial structural priors in 3D space from training data to refine anchor attributes and saliency, enabling regionally adaptive Gaussian decoding for geometric fidelity. Without known camera parameters or test-time optimization, our approach effectively resolves artifacts and prunes the redundant 3DGS in a single feed-forward pass. Experiments on multiple datasets demonstrate our state-of-the-art performance on both novel view synthesis and depth estimation, demonstrating superior efficiency, robustness, and generalization ability for long-term generalizable 3D reconstruction. Code will be released.

Qualitative Results of Reconstructed 3DGS

Comparison with FreeSplat

FreeSplat (w/ GT Pose)

Ours (w/o GT Pose)

Visualization of Progressive 3DGS Reconstruction

Comparison of the Rendered RGB and Depth

MVSplat (w/ GT Pose)

Pixel-Gaussian (w/ GT Pose)

Long-LRM (w/ GT Pose)

Ours (w/o GT Pose)

MVSplat (w/ GT Pose)

Pixel-Gaussian (w/ GT Pose)

Long-LRM (w/ GT Pose)

Ours (w/o GT Pose)

MVSplat (w/ GT Pose)

Pixel-Gaussian (w/ GT Pose)

Long-LRM (w/ GT Pose)

Ours (w/o GT Pose)

MVSplat (w/ GT Pose)

Pixel-Gaussian (w/ GT Pose)

Long-LRM (w/ GT Pose)

Ours (w/o GT Pose)

Pipeline

Given a stream of unposed and uncalibrated images as input, we employ a 3D reconstruction network to achieve online generalizable Gaussian prediction, reconstructing over 50 views at a speed exceeding 10 FPS.

Visualization of Global Gaussian Splatting and Extrapolation Views. We zoom in extrapolation views for better comparison with FreeSplat (w/ GT pose)

Quantitative Comparisons

Qualitative Comparison on ScanNet dataset. Given 10 context views as input, we compare the novel view synthesis results for both rendered color and depth. We also provide qualitative results in supplementary material.

Qualitative Results of Novel View Synthesis with Rendered RGB and Depth on Replica and Scannet++ datasets.

The learned saliency shows higher value and dense distribution for the region with more complex geometry and appearance.

Novel View Evaluation and Depth Estimation Results

Ablation Study

Applying β from 0 to 0.5 removes most redundant Gaussians and significantly improves the performance. A larger β = 0.8 could further remove redundant Gaussians with little degradation.

Ablation study on the Number of Context Views. We control the input views from 10 to 200 in the office 02 scene on Replica

Ablation Study for the effects of different components on ScanNet

Ablation study on the Resolution of Quantization. We control the resolution from 0 (no quantization) to 0.02 from 30 views in the office 02 scene on Replica