Transformer架构下跨多尺度信息融合的遥感影像建筑提取

刘异; 张寅捷; 敖洋; 江大龙; 张肇睿

doi:10.11834/jrs.20233017

浏览量 : 0 下载量: 117 CSCD: 0 更多指标

R-PDF
PDF
导出
分享
收藏
专辑

Transformer架构下跨多尺度信息融合的遥感影像建筑提取
Building extraction from remote sensing images based on multi-scale information fusion method under Transformer architecture
2023年页码：1-11
网络出版日期： 2023-06-25 ，
DOI： 10.11834/jrs.20233017

扫描看全文

刘异，张寅捷，敖洋，江大龙，张肇睿.XXXX.Transformer架构下跨多尺度信息融合的遥感影像建筑提取.遥感学报，XX（XX）： 1-11

LIU Yi，ZHANG Yinjie，AO Yang，JIANG Dalong，ZHANG Zhaorui. XXXX. Building extraction from remote sensing images based on multi-scale information fusion method under Transformer architecture. National Remote Sensing Bulletin， XX（XX）：1-11
刘异，张寅捷，敖洋，江大龙，张肇睿.XXXX.Transformer架构下跨多尺度信息融合的遥感影像建筑提取.遥感学报，XX（XX）： 1-11 DOI： 10.11834/jrs.20233017.

LIU Yi，ZHANG Yinjie，AO Yang，JIANG Dalong，ZHANG Zhaorui. XXXX. Building extraction from remote sensing images based on multi-scale information fusion method under Transformer architecture. National Remote Sensing Bulletin， XX（XX）：1-11 DOI： 10.11834/jrs.20233017.

摘要

建筑物是城市中最为普遍的基础设施，获取遥感影像中的建筑区域对于城市规划、人口估计和灾情分析等具有重要的意义。本文基于Transformer结构，设计了一种端到端的遥感影像建筑区域提取方法。首先，针对多尺度影像特征存在的信息冗余和信息差异问题，本文提出了一种多次特征金字塔结构Tri-FPN（Triple-Feature Pyramid Network），实现跨越近邻尺度的全局多尺度信息融合，提高多尺度特征的类别表征一致性并减少信息的冗余。其次，针对多尺度提取结果融合时仅考虑尺度因素的问题，本文设计了一种顾及“尺度-类别-空间”的注意力模块CSA-Module（Class-Scale Attention Module），有效融合了不同尺度下的建筑提取结果。最后，在Transformer结构上加入Tri-FPN与CSA-Module进行模型训练，获得最佳的建筑提取效果。实验对比分析表明，本文的方法有效提高了建筑区域的检出率，并提供出更为准确的建筑轮廓，提升了遥感影像中建筑的提取精度，在WHU Building数据集和INRIA数据集上分别取得了91.53%和81.7%的IOU分数。

Abstract

Objective As the develop of deep learning

researchers are paying more attention to its application in remote sensing building extraction. In order to obtain better details and overall effects

many experiments about multi-scale feature fusion

which boosts the performance during the feature inference stage

and multi-scale outputs fusion are conducted to achieve a trade-off between accuracy and efficiency. However

current multi-scale feature fusion methods only consider the nearest feature

which is not sufficient for cross-scale feature fusion. The functions of multi-scale outputs fusion are also limited in an unary correlation

which only takes the scale element into account. To address these problems

we propose a feature fusion method and a results fusion module to improve the accuracy of building extraction from remote sensing images.(1) Method This paper has proposed Tri-FPN(Triple-Feature Pyramid Network) and CSA-Module (Class-Scale Attention Module) based on Segformer to extract building in remote sensing images. The whole network structure is divided into three components: feature extraction

feature fusion and classification head.(2) In the feature extraction component

this paper adopts the Segformer structure to extract multi-scale feature. The Segformer utilizes the self-attention function to extract feature maps of different scales. To adaptively enlarge the receptive fields

Segformer uses strided convolution kernel to shrink the key and value vector in self-attention computation process. The calculation cost decreases significantly.In the feature fusion component

the goal is to fuse the multi-scale feature from different part of the feature extraction network. Tri-FPN consists of 3 feature pyramid networks. The fusion follows a sequence of “top-down”

“bottom-up” and “top-down”

which enlarges the scale-receptive field. The basic fusion block are 3×3 convolution with feature element-wise addition and 1×1 convolution with channel concatenation. This design helps maintain the spatial diversity and the inner-class feature consistency.In the classification head component

each pixel is assigned a predicted label. First

the feature map goes through a 1×1 convolution to get a coarse result. Second the feature map shrinks in the channel dimension by 1×1 convolution. Third

the shrunk feature map is concatenated with the coarse result and 2× up-sampled. Fourth

the mixed feature is segmented by 5×5 convolution. A Height×Width×classes attention map

which takes class information

scale diversity and spatial details into account

is calculated by a 3×3 convolution block on the mixed feature at the same time. Last

the coarse result and the mixed-feature result is fused under attention map.results A series of experiments were carried out on the WHU Building and INRIA datasets.For the WHU Building dataset

the precision reaches 95.42%

the recall 96.25% and IOU 91.53%. For the INRIA dataset

the precision

recall and IOU reach 89.33%

91.10% and 81.7% respectively. Compared with the backbone

the increase in recall and IOU exceed over 1%. It is proved that the proposed method has strong feature fusion and segmentation ability.(3) Conclusion The Tri-FPN effectively improves the building extraction accuracy and the overall efficiency

especially on the boundaries and the holes in building area

which verifies the validity of multi-scale feature fusion. By taking C(class)

S(Scale) and spatial attention into account

the CSA-Module can greatly improve the accuracy with negligible parameters. By adopting both Tri-FPN and CSA-Module

the structure improve the ability of predicting small buildings and the details in remote sensing images.(4)

关键词

遥感影像、建筑提取、深度学习、Transformer影像特征金字塔类别尺度注意力

Keywords

remote sensing imagesbuilding extractiondeep learningTransformerimage feature pyramidclass-scale attention

references

Beril, Sirmacek, and Cem Unsalan. 2008. Building detection from aerial images using invariant color features and shadow information. 23rd International Symposium on Computer and Information Sciences, pp. 1–5.

Chen, Kaiqiang, Fu Kun, Gao Xin, Yan Menglong, Sun Xian, and Zhang Huan. 2017. Building extraction from remote sensing images with deep learning in a supervised manner. IGARSS 2017-2017 IEEEInternational Geoscience and Remote Sensing Symposium

Chen, Keyan, Zou Zhengxia, andi Shi Zhenwe. 2021. Building Extraction from Remote Sensing Images with Sparse Token Transformers. Remote Sensing 13(21): 4441

Chen, Liang-Chieh, Zhu Yukun, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoderdecoder with atrous separable convolution for semantic image segmentation. Eur. Conf. Comput. Vis. (ECCV): 1-18

Chen, Meng, Wu Jianjun, Liu Leizhen, Zhao Wenhui, Tian Feng, Shen Qiu, Zhao Bingyu, and Du Ruohua. 2021. DR-Net: An Improved Network for Building Extraction from High Resolution Remote Sensing Image. Remote Sensing 13(2): 294

Deng, Wenjing, Sh Qiani, and Li Jun. 2021. Attention-Gate-Based Encoder-Decoder Network for Automatical Building Extraction. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 2611-2620

Emmanuel, Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. 2017. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. IGARSS 2017-2017 IEEEInternational Geoscience and Remote Sensing Symposium

Guo, Haonan, Su Xin, Tang Shengkun, Du Bo, and Zhang Liangpei. 2021. Scale-Robust Deep-Supervision Network for Mapping Building Footprints From High-Resolution Remote Sensing Images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 10091-10100

Guo, Mingqing, Liu Heng, Xu Yongyang, and Huang Ying. 2020. Building extraction based on U-Net with an attention block and multiple losses. Remote Sensing 12(9): 140

Hu, Lei, Niu Chuang, Ren Shenghan, Dong Minghao, Zheng Changli, Zhang Wei, and Liang Jimin. 2021. Discriminative Context-Aware Network for Target Extraction in Remote Sensing Imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 15: 700-715

Ji, Shunping, Wei Shiqing, and Lu Meng. 2018. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on Geoscience and Remote Sensing 57(1):574-586

Jonathan, Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR)pp: 3431-3440

Jung, Hoin, Choi hansoo, and Kang Myungjoo. 2021. Boundary Enhancement Semantic Segmentation for Building Extraction From Remote Sensed Image. IEEE Transactions on Geoscience and Remote Sensing 60, Art . no.5215512

Li, Junjun, Cao Jiannong, Zhu Yingying and Cheng Beibei. 2020. Built-up area detection from high resolution remote sensingimages using geometric features. Journal of Remote Sensing（Chinese），24（3）：233-244

李军军，曹建农，朱莹莹，程贝贝.2020.高分辨率遥感影像建筑区域局部几何特征提取.遥感学报，24（3）：233-244［DOI：10.11834/jrs.20208506http://dx.doi.org/10.11834/jrs.20208506］

Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, He Kaiming, Bharath Hariharan, and Serge Belongie. 2017. Feature Pyramid Networks for Object Detection. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Liu, Ze, Lin Yutong, Cao Yue, Hu Han, Wei Yixuan, Zhang Zheng, Stephen Lin, and Guo Baining. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Mohammad, Awrangjeb, Zhang, Chunsun and Clive S. 2011.Fraser. Improved building detection using texture information. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 38, 143-148

Olaf, Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. Conf. Med. Image Comput. Comput.-Assist. Intervent pp: 234-241

Sun, Ke, Zhao Yang, Jiang Borui, Cheng Tianheng, Xiao Bin, Liu Dong, Mu Yadong, Wang Xinggan, Liu Wenyu, and Wang Jingdong, 2019. High-resolution representations for labeling pixels and regions. arXiv preprint arXiv:1904.04514

Wang, Libo, Fang Shenghui, Meng Xiaoliang and Li Rui. 2022. Building Extraction With Vision Transformer. IEEE Transactions on Geoscience and Remote Sensing 60

Wang, Zhenqing, Zhou Yi, Wang Shixin, Wang Futao and Xu Zhiyu. 2021. House building extraction from high-resolutionremote sensing images based on IEU-Net. National Remote Sensing Bulletin, 25(11): 2245-2254

王振庆，周艺，王世新，王福涛，徐知宇.2021.IEU-Net高分辨率遥感影像房屋建筑物提取.遥感学报，25（11）：2245-2254［DOI：10.11834/jrs.20210042http://dx.doi.org/10.11834/jrs.20210042］

Xiao, Xiao, Guo Wenliang, Chen Rui, Hu Yilongi, Wang Jianing and Zhao Hongyu. 2022. A swin transformer-based encoding booster integrated in u-shaped network for building extraction. Remote Sensing 14(11): 2611

Xie, Enze, Wang Wenhai, Yu Zhiding, Anima Anandkumar, Jose M. Alvarez, and Luo Ping. 2021. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Neural Information Processing Systems (NeurIPS)

Yan, Zhi, Li Liwei and Cheng Gang. 2019. Extraction of high-rise and low-rise building areas from Sentinel-2 data based on full convolution networks Bulletin of surveying and mapping 508(07): 73-77

闫智,李利伟,程钢.2019.利用全卷积网络提取Sentinel-2影像高低层建筑区.测绘通报.508(07):73-77.[DOI:10.13474/j.cnki.11-2246.2019.0222http://dx.doi.org/10.13474/j.cnki.11-2246.2019.0222]

Zhang, Haoren, Zhao Jianghong and Zhang Xiaoguang. 2020. High-resolution Image Building Extraction Using U-net Neural Network. Remote Sensing Information 3(35): 143-150

张浩然,赵江洪,张晓光.2020.利用U-net网络的高分遥感影像建筑提取方法[J].遥感信息,2020,35(03):143-150 [DOI: 10.3969/jhttp://dx.doi.org/10.3969/j,issn.1000-3177.2020.03.020]

Zhang, Yun. 1999.Optimisation of building detection in satellite images by combining multispectral classification and texture filtering. ISPRS Journal of Photogrammetry and Remote Sensing ,54(1): 50-60

Zhao, Hengshuang, Shi Jianping, Qi Xiaojuan, Wang Xiaogang and Jia Jiaya. 2017. Pyramid Scene Parsing Network. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Zheng, Xianwei, Huan Linxi, Xia Gui-Song, and Gong Jianya. 2020. Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss. ISPRS Journal of Photogrammetry and Remote Sensing Volume 170: 15-28

Zhong, Sheng-hua, Huang Jian-jun, and Xie Wei-xin. 2008. A new method of building detection from a single aerial photograph. 9th International Conference on Signal Processing pp. 1219–1222

Zhou, Dengji, Wang Guizhou, He Guojin, Long Tengfei, Yin Ranyu, Zhang Zhaoming, Chen Sibao, and Luo Bin. 2020. Robust Building Extraction for High Spatial Resolution Remote Sensing Images with Self-Attention Network. Sensors 20(24): 7241

文章被引用时，请邮件提醒。

提交

暂无数据