DiffusionMVS：基于扩散约束的遥感影像立体重建算法

连远锋; 王森

doi:10.11834/jrs.20265119

模型与方法 | 浏览量 : 0 下载量: 5 CSCD: 0

R-PDF
PDF
导出
分享
收藏
专辑

DiffusionMVS：基于扩散约束的遥感影像立体重建算法
DiffusionMVS： Multi-view stereo reconstruction algorithm for remote sensing image based on diffusion constraints
2026年30卷第5期页码：1510-1523
收稿：2025-04-03，

纸质出版：2026-05-07
DOI： 10.11834/jrs.20265119
稿件说明：

移动端阅览

连远锋，王森.2026.DiffusionMVS：基于扩散约束的遥感影像立体重建算法.遥感学报，30（5）： 1510-1523 DOI： 10.11834/jrs.20265119.

Lian Y F and Wang S. 2026. DiffusionMVS： Multi-view stereo reconstruction algorithm for remote sensing image based on diffusion constraints. National Remote Sensing Bulletin， 30（5）：1510-1523 DOI： 10.11834/jrs.20265119.

摘要

针对遥感影像多视图立体任务中存在的特征匹配精度低、预测深度图存在噪声和边缘重建不完整等问题，本文提出一种基于扩散约束的多视图立体网络DiffusionMVS（Diffusion Multi-View Stereo）。首先，在特征金字塔网络的基础上，设计基于特征增强的多尺度特征提取模块MFE-FPN（Multi-scale Feature Enhancement Feature Pyramid Network）来增强网络学习多视图遥感影像特征的能力；其次，提出自适应特征聚合模块AFA（Adaptive Feature Aggregation）来动态整合不同层次的特征以捕获目标边缘的深度细节特征；最后，设计基于扩散约束的代价体优化模型DCM（Diffusion Constrained Module），通过优化存在噪声点的深度值分布来消除预测深度图存在的噪声干扰，并结合边缘引导的Transformer网络优化深度图边缘重建效果。实验结果显示，在WHU-TLC和LuoJia-MVS数据集测试中，与基准模型相比，本文提出的模型DiffusionMVS网络的平均绝对误差（Mean Absolute Error，MAE）指标分别提升了28.11%和3.37%，展示了较好的重建性能和泛化能力。

Abstract

Large-scale 3D scene reconstruction based on remote sensing images provides critical support for smart city development

map navigation

virtual reality

and digital twin systems. Existing 3D reconstruction algorithms predominantly rely on feature matching techniques and demonstrate satisfactory performance in small-scale or structurally simple scenes. Given the intricate terrain features and noise interference in complex or large-scale environments

significant challenges

such as suboptimal reconstruction accuracy and incomplete modeling

exist. These challenges hinder the effectiveness of these methods. Therefore

this study proposes a diffusion-constrained multiview stereo network comprising a multiscale feature enhancement feature pyramid network (MFE-FPN)

an adaptive feature aggregation module (AFA)

and a diffusion-constrained module (DCM)

to address the issues of low-feature matching accuracy

high noise in predicted depth maps

and incomplete edge reconstruction in multiview stereo for remote sensing images.

The proposed method consists of several steps. First

the network takes N multiview remote sensing images as input

with the first image serving as the reference and the remaining N-1 as source images. It adopts a three-stage coarse-to-fine strategy to predict depth maps progressively. The network utilizes the MFE-FPN module to extract multiscale features from the input images

thereby generating hierarchical feature representations. Second

the top-level features from the FPN are mapped through an edge-aware network to compute edge-aware features

which are subsequently fused with the multiscale features. Third

an AFA is designed to aggregate the multiscale features

thereby forming a matching cost volume. Fourth

a diffusion constraint module is introduced to integrate cost volume features with edge-aware features. Fifth

an edge-guided transformer is employed to enhance the representation of edge details during the denoising stage. Sixth

the cost volume features are regularized and regressed to estimate depth

resulting in the final reconstructed depth map. Seventh

an edge-aware loss function is constructed during training to preserve the edge information in the predicted depth maps effectively.

Experimental results show that compared with other methods

the DiffusionMVS network shows an improved mean absolute error metric on the WHU-TLC and LuoJia-MVS datasets by 28.11% and 3.37%

respectively

thereby demonstrating superior reconstruction performance. However

in terms of inference time

the proposed method does not achieve the best performance because of the relatively low operational efficiency of the diffusion constraint module. Nevertheless

it achieves an optimal balance between accuracy and efficiency

thereby making it highly suitable for remote sensing stereo reconstruction tasks. The results on the self-constructed dataset of oil and gas stations verify the model’s capability to reconstruct detailed geometric features. This capability benefits from the model’s excellent performance in edge preservation and generalization in unseen scenarios. Moreover

ablation experiment results confirm that the proposed MFE-FPN

AFA

and DCM modules can effectively enhance the accuracy of depth map reconstruction.

The proposed diffusion-constrained multiview stereo network significantly improves edge-processing capability and overall reconstruction accuracy through a multiscale feature enhancement module and a diffusion constraint module. Results indicate the model is well-suited for reconstructing mountains

forests

and buildings

because of its superior performance on weak-texture regions and depth map denoising challenges. It effectively addresses the reduced reconstruction accuracy of remote sensing images under noise interference. Future work will explore incorporating the Segment Anything Model into the MVS framework to leverage its rich semantic information

thereby refining the matching process and further improving reconstruction efficiency and accuracy.

关键词

Keywords

references

Barnes C , Shechtman E , Finkelstein A and Goldman D B . 2009 . PatchMatch: a randomized correspondence algorithm for structural image editing . ACM Transactions on Graphics , 28 ( 3 ): 24 [ DOI: 10.1145/1531326.1531330 http://dx.doi.org/10.1145/1531326.1531330 ]

Chen X L , Diao W H , Zhang S , Wei Z W and Liu C B . 2024 . SA-SatMVS: slope feature-aware and across-scale information integration for large-scale earth terrain multi-view stereo . Remote Sensing , 16 ( 18 ): 3474 [ DOI: 10.3390/rs16183474 http://dx.doi.org/10.3390/rs16183474 ]

Cheng S , Xu Z X , Zhu S L , Li Z W , Li L E , Ramamoorthi R and Su H . 2020 . Deep stereo using adaptive thin volume representation with uncertainty awareness // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 2521 - 2531 [ DOI: 10.1109/CVPR42600.2020.00260 http://dx.doi.org/10.1109/CVPR42600.2020.00260 ]

Dong R M , Yuan S , Luo B , Chen M X , Zhang J X , Zhang L X , Li W J , Zheng J P and Fu H H . 2024 . Building bridges across spatial and temporal resolutions: reference-based super-resolution via change priors and conditional diffusion model // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 27674 - 27684 [ DOI: 10.1109/CVPR52733.2024.02614 http://dx.doi.org/10.1109/CVPR52733.2024.02614 ]

Gallup D , Frahm J M , Mordohai P , Yang Q X and Pollefeys M . 2007 . Real-time plane-sweeping stereo with multiple sweeping directions // 2007 IEEE Conference on Computer Vision and Pattern Recognition . Minneapolis : IEEE: 1 - 8 [ DOI: 10.1109/CVPR.2007.383245 http://dx.doi.org/10.1109/CVPR.2007.383245 ]

Gao J , Liu J and Ji S P . 2021 . Rational polynomial camera model warping for deep learning based satellite multi-view stereo matching // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision . Montreal : IEEE: 6128 - 6137 [ DOI: 10.1109/ICCV48922.2021.00609 http://dx.doi.org/10.1109/ICCV48922.2021.00609 ]

Gao J , Liu J and Ji S P . 2023 . A general deep learning based framework for 3D reconstruction from multi-view stereo satellite images . ISPRS Journal of Photogrammetry and Remote Sensing , 195 : 446 - 461 [ DOI: 10.1016/j.isprsjprs.2022.12.012 http://dx.doi.org/10.1016/j.isprsjprs.2022.12.012 ]

Gu X D , Fan Z W , Zhu S Y , Dai Z Z , Tan F T and Tan P . 2020 . Cascade cost volume for high-resolution multi-view stereo and stereo matching // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 2492 - 2501 [ DOI: 10.1109/CVPR42600.2020.00257 http://dx.doi.org/10.1109/CVPR42600.2020.00257 ]

Han L T , Zhao Y C , Lv H Y , Zhang Y S , Liu H L and Bi G L . 2022 . Remote sensing image denoising based on deep and shallow feature fusion and attention mechanism . Remote Sensing , 14 ( 5 ): 1243 [ DOI: 10.3390/rs14051243 http://dx.doi.org/10.3390/rs14051243 ]

Heo S and Lee S . 2024 . Denoising diffusion for multi-view stereo // 2024 International Conference on Electronics, Information, and Communication (ICEIC) . Taipei, China : IEEE: 1 - 3 [ DOI: 10.1109/ICEIC61013.2024.10457167 http://dx.doi.org/10.1109/ICEIC61013.2024.10457167 ]

Ho J , Jain A and Abbeel P . 2020 . Denoising diffusion probabilistic models // Proceedings of the 34th International Conference on Neural Information Processing Systems . Vancouver : Curran Associates Inc.: 6840 - 6851

Hu J , Shen L and Sun G . 2018 . Squeeze-and-excitation networks // Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Salt Lake City : IEEE: 7132 - 7141 [ DOI: 10.1109/CVPR.2018.00745 http://dx.doi.org/10.1109/CVPR.2018.00745 ]

Ibrahimli N , Ledoux H , Kooij J F P and Nan L L . 2023 . DDL-MVS: depth discontinuity learning for multi-view stereo networks . Remote Sensing , 15 ( 12 ): 2970 [ DOI: 10.3390/rs15122970 http://dx.doi.org/10.3390/rs15122970 ]

Khan N , Kim M H and Tompkin J . 2021 . Differentiable diffusion for dense depth estimation from multi-view images // Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Nashville : IEEE: 8908 - 8917 [ DOI: 10.1109/CVPR46437.2021.00880 http://dx.doi.org/10.1109/CVPR46437.2021.00880 ]

Kirillov A , Mintun E , Ravi N , Mao H Z , Rolland C , Gustafson L , Xiao T T , Whitehead S , Berg A C , Lo W Y , Dollár P and Girshick R . 2023 . Segment anything // Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV) . Paris : IEEE: 3992 - 4003 [ DOI: 10.1109/ICCV51070.2023.00371 http://dx.doi.org/10.1109/ICCV51070.2023.00371 ]

Li J Y , Huang X , Feng Y J , Ji Z , Zhang S L and Wen D W . 2023 . A hierarchical deformable deep neural network and an aerial image benchmark dataset for surface multiview stereo reconstruction . IEEE Transactions on Geoscience and Remote Sensing , 61 : 5600812 [ DOI: 10.1109/TGRS.2023.3234694 http://dx.doi.org/10.1109/TGRS.2023.3234694 ]

Li Z Y , Li Z Q , Cui Z P , Pollefeys M and Oswald M R . 2024 . Sat2Scene: 3D urban scene generation from satellite images with diffusion // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 7141 - 7150 [ DOI: 10.1109/CVPR52733.2024.00682 http://dx.doi.org/10.1109/CVPR52733.2024.00682 ]

Lin L , Zhang Y B , Wang Z J , Zhang L L , Liu X F and Wang Q Q . 2023 . A-SATMVSNet: an attention-aware multi-view stereo matching network based on satellite imagery . Frontiers in Earth Science , 11 : 1108403 [ DOI: 10.3389/feart.2023.1108403 http://dx.doi.org/10.3389/feart.2023.1108403 ]

Liu J , Gao J , Ji S P , Zeng C , Zhang S Y and Gong J Y . 2023a . Deep learning based multi-view stereo matching and 3D scene reconstruction from oblique aerial images . ISPRS Journal of Photogrammetry and Remote Sensing , 204 : 42 - 60 [ DOI: 10.1016/j.isprsjprs.2023.08.015 http://dx.doi.org/10.1016/j.isprsjprs.2023.08.015 ]

Liu J and Ji S P . 2020 . A novel recurrent encoder-decoder structure for large-scale multi-view stereo reconstruction from an open aerial dataset // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 6049 - 6058 [ DOI: 10.1109/CVPR42600.2020.00609 http://dx.doi.org/10.1109/CVPR42600.2020.00609 ]

Liu N N , Wang P H , Xiang S Y , Gu N N and Wang F . 2023b . RS-MVSNet: inferring the earth's digital surface model from multi-view optical remote sensing images // IECON 2023-49th Annual Conference of the IEEE Industrial Electronics Society . Singapore : IEEE: 1 - 7 [ DOI: 10.1109/IECON51785.2023.10311909 http://dx.doi.org/10.1109/IECON51785.2023.10311909 ]

Luo H T , Zhang J M , Liu X F , Zhang L L and Liu J Y . 2024 . Large-scale 3D reconstruction from multi-view imagery: a comprehensive review . Remote Sensing , 16 ( 5 ): 773 [ DOI: 10.3390/rs16050773 http://dx.doi.org/10.3390/rs16050773 ]

Mao Y Q , Bi H B , Xu L Y , Chen K Q , Wang Z R , Sun X and Fu K . 2024 . SDL-MVS: view space and depth deformable learning paradigm for multiview stereo reconstruction in remote sensing . IEEE Transactions on Geoscience and Remote Sensing , 62 : 5641518 [ DOI: 10.1109/TGRS.2024.3464574 http://dx.doi.org/10.1109/TGRS.2024.3464574 ]

Merrell P , Akbarzadeh A , Wang L , Mordohai P , Frahm J M , Yang R G , Nister D and Pollefeys M . 2007 . Real-time visibility-based fusion of depth maps // 2007 IEEE 11th International Conference on Computer Vision . Rio de Janeiro : IEEE: 1 - 8 [ DOI: 10.1109/ICCV.2007.4408984 http://dx.doi.org/10.1109/ICCV.2007.4408984 ]

Schönberger J L and Frahm J M . 2016 . Structure-from-motion revisited // Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition . Las Vegas : IEEE: 4104 - 4113 [ DOI: 10.1109/CVPR.2016.445 http://dx.doi.org/10.1109/CVPR.2016.445 ]

Shao R Z , Zheng Z R , Zhang H W , Sun J X and Liu Y B . 2022 . DiffuStereo: high quality human reconstruction via diffusion-based stereo using sparse cameras // 17th European Conference on Computer Vision . Tel Aviv : Springe: 702 - 720 [ DOI: 10.1007/978-3-031-19824-3_41 http://dx.doi.org/10.1007/978-3-031-19824-3_41 ]

Song J M , Meng C L and Ermon S . 2022 . Denoising diffusion implicit models . arXiv preprint arXiv : 2010 . 02502 [ DOI: 10.48550/arXiv.2010.02502 http://dx.doi.org/10.48550/arXiv.2010.02502 ]

Toker A , Eisenberger M , Cremers D and Leal-Taixé L . 2024 . SatSynth: augmenting image-mask pairs through diffusion models for aerial semantic segmentation // Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 27685 - 27695 [ DOI: 10.1109/CVPR52733.2024.02615 http://dx.doi.org/10.1109/CVPR52733.2024.02615 ]

Wang L , Jia J L and Dai H L . 2024 . OrientedDiffDet: diffusion model for oriented object detection in aerial images . Applied Sciences , 14 ( 5 ): 2000 [ DOI: 10.3390/app14052000 http://dx.doi.org/10.3390/app14052000 ]

Wei Z Z , Zhu Q T , Min C , Chen Y S and Wang G P . 2021 . AA-RMVSNet: adaptive aggregation recurrent multi-view stereo network // 2021 IEEE/CVF International Conference on Computer Vision (ICCV) . Montreal : IEEE: 6167 - 6176 [ DOI: 10.1109/ICCV48922.2021.00613 http://dx.doi.org/10.1109/ICCV48922.2021.00613 ]

Wen Y H , Ma X P , Zhang X K and Pun M O . 2024 . GCD-DDPM: a generative change detection model based on difference-feature guided DDPM . IEEE Transactions on Geoscience and Remote Sensing , 62 : 5404416 [ DOI: 10.1109/TGRS.2024.3381752 http://dx.doi.org/10.1109/TGRS.2024.3381752 ]

Woo S , Park J , Lee J Y and Kweon , I S . 2018 . CBAM: convolutional block attention module // Proceedings of the 15th European Conference on Computer Vision (ECCV) . Munich : Springer: 3 - 19 [ DOI: 10.1007/978-3-030-01234-2_1 http://dx.doi.org/10.1007/978-3-030-01234-2_1 ]

Wu B J and Huang H . 2020 . Survey on 3D reconstruction of transparent objects . Journal of Computer-Aided Design and Computer Graphics , 32 ( 2 ): 173 - 180

吴博剑 , 黄惠 . 2020 . 透明物体的三维重建综述 . 计算机辅助设计与图形学学报 , 32 ( 2 ): 173 - 180 [ DOI: 10.3724/SP.J.1089.2020.18101 http://dx.doi.org/10.3724/SP.J.1089.2020.18101 ]

Wu Z T , Xiao M Q , Fang C and Lin Z C . 2024 . Designing universally-approximating deep neural networks: a first-order optimization approach . IEEE Transactions on Pattern Analysis and Machine Intelligence , 46 ( 9 ): 6231 - 6246 [ DOI: 10.1109/TPAMI.2024.3380007 http://dx.doi.org/10.1109/TPAMI.2024.3380007 ]

Xu Q S and Tao W B . 2019 . Multi-scale geometric consistency guided multi-view stereo // Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE: 5478 - 5487 [ DOI: 10.1109/CVPR.2019.00563 http://dx.doi.org/10.1109/CVPR.2019.00563 ]

Yan H B , Xu F Q , Huang L E , Liu C B and Lin C X . 2023 . Review of multi-view stereo reconstruction methods based on deep learning . Optics and Precision Engineering , 31 ( 16 ): 2444 - 2464

鄢化彪 , 徐方奇 , 黄绿娥 , 刘词波 , 林初欣 . 2023 . 基于深度学习的多视图立体重建方法综述 . 光学精密工程 , 31 ( 16 ): 2444 - 2464 [ DOI: 10.37188/OPE.20233116.2444 http://dx.doi.org/10.37188/OPE.20233116.2444 ]

Yao Y , Luo Z X , Li S W , Fang T and Quan L . 2018 . MVSNet: depth inference for unstructured multi-view stereo // 15th European Conference on Computer Vision . Munich : Springer: 785 - 801 [ DOI: 10.1007/978-3-030-01237-3_47 http://dx.doi.org/10.1007/978-3-030-01237-3_47 ]

Yao Y , Luo Z X , Li S W , Shen T W , Fang T and Quan L . 2019 . Recurrent MVSNet for high-resolution multi-view stereo depth inference // Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition . Long Beach : IEEE: 5520 - 5529 [ DOI: 10.1109/CVPR.2019.00567 http://dx.doi.org/10.1109/CVPR.2019.00567 ]

Yu D W , Ji S P , Liu J and Wei S Q . 2021 . Automatic 3D building reconstruction from multi-view aerial images with deep learning . ISPRS Journal of Photogrammetry and Remote Sensing , 171 : 155 - 170 [ DOI: 10.1016/j.isprsjprs.2020.11.011 http://dx.doi.org/10.1016/j.isprsjprs.2020.11.011 ]

Yu Z H and Gao S H . 2020 . Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and gauss-newton refinement // Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . Seattle : IEEE: 1946 - 1955 [ DOI: 10.1109/CVPR42600.2020.00202 http://dx.doi.org/10.1109/CVPR42600.2020.00202 ]

Zhang S , Wei Z W , Xu W J , Zhang L L , Wang Y , Zhang J M and Liu J Y . 2024 . Edge aware depth inference for large-scale aerial building multi-view stereo . ISPRS Journal of Photogrammetry and Remote Sensing , 207 : 27 - 42 [ DOI: 10.1016/j.isprsjprs.2023.11.020 http://dx.doi.org/10.1016/j.isprsjprs.2023.11.020 ]

Zhang S , Wei Z W , Xu W J , Zhang L L , Wang Y , Zhou X and Liu J Y . 2023a . DSC-MVSNet: attention aware cost volume regularization based on depthwise separable convolution for multi-view stereo . Complex and Intelligent Systems , 9 ( 6 ): 6953 - 6969 [ DOI: 10.1007/s40747-023-01106-3 http://dx.doi.org/10.1007/s40747-023-01106-3 ]

Zhang S , Xu W J , Wei Z W , Zhang L L , Wang Y and Liu J Y . 2023b . ARAI-MVSNet: a multi-view stereo depth estimation network with adaptive depth range and depth interval . Pattern Recognition , 144 : 109885 [ DOI: 10.1016/j.patcog.2023.109885 http://dx.doi.org/10.1016/j.patcog.2023.109885 ]

Zhou L Y , Zhang Z , Jiang H Q , Sun H , Bao H J and Zhang G F . 2021 . DP-MVS: detail preserving multi-view surface reconstruction of large-scale scenes . Remote Sensing , 13 ( 22 ): 4569 [ DOI: 10.3390/rs13224569 http://dx.doi.org/10.3390/rs13224569 ]