基于双阶段高阶Transformer的遥感图像场景分类
Remote Sensing Image Scene Classification Based on Two-Stage High-Order Transformer
- 2023年 页码:1-13
网络出版日期: 2023-12-14
DOI: 10.11834/jrs.20233332
扫 描 看 全 文
浏览全部资源
扫码关注微信
网络出版日期: 2023-12-14 ,
扫 描 看 全 文
吴倩倩,倪康,郑志忠.XXXX.基于双阶段高阶Transformer的遥感图像场景分类.遥感学报,XX(XX): 1-13
WU Qianqian,NI Kang,Zheng Zhizhong. XXXX. Remote Sensing Image Scene Classification Based on Two-Stage High-Order Transformer. National Remote Sensing Bulletin, XX(XX):1-13
Transformer模型因其强大的全局特征建模和长距离依赖关系的表征能力现已广泛应用于遥感图像场景分类领域,但遥感场景图像存在空间结构复杂、目标尺度变化大等挑战,直接采用ViT(Vision Transformer)中固定尺寸的图像分块方式和深度特征表示不能有效刻画遥感场景图像的空间特征信息。针对上述问题,本文提出一种基于双阶段高阶Transformer(Two-stage High-order Vision Transformer, THViT)的遥感图像场景分类方法。该方法以LV-ViT-S网络为主干网,包含粗-细动态分类双阶段,该阶段首先通过将遥感图像分割为较大尺度的图像块,进行易分类遥感场景图像的分类工作;然后根据类注意力机制和信息区域提取模块完成对遥感场景图像的再分块,该阶段可完成较复杂遥感场景图像的分类。同时,为了提升深度特征的可判别性,THViT引入布朗协方差高阶特征表示,从统计学角度,有效捕获遥感场景图像的判别深度特征表示。另外,为了克服Transformer网络仅使用分类Tokens作为分类特征的局限性,本文将分类Tokens和高阶特征Tokens同时输入Softmax分类器,提升遥感图像场景分类性能,并验证了高阶特征Tokens对遥感图像场景分类的有效性。实验结果表明:与CFDNN、GLDBS、GAN、GCN、D-CapsNet、SCCov、ViT、Swin-T、LV-ViT-S和SCViT等相关算法对比,THViT在NWPU45(NWPU-RESISC45 Dataset)和AID(Aerial Image Dataset)数据集上均有较优异的性能表现。
(Objective)
2
Transformer has been widely used in the field of remote sensing image scene classification because of its powerful global feature modeling and long-distance dependency representation capabilities. However
remote sensing scene images have challenges such as complex spatial structures and large changes in target scale. Directly adopting the fixed-size image block method and deep feature representation of ViT (Vision Transformer) cannot effectively depict the spatial feature information of remote sensing scene images.
(Method)
2
For alleviating the above problems
this paper proposes a remote sensing image scene classification method based on Two-stage High-order Vision Transformer (THViT). The method takes LV-ViT-S network as backbone
which includes a two-stage dynamic classification of coarse and fine. In this stage
the remote sensing image is first segmented into larger scale image blocks for easy classification of remote sensing scene images. Then
based on the class attention mechanism and information region extraction module
the remote sensing scene images are further segmented
and this stage can complete the classification of more complex remote sensing scene images. The coarse-fine dynamic classification stage can be adjusted by threshold. Simultaneously
for improving the discriminability of deep features
THViT introduces brownian covariance high-order feature representation
which effectively captures the discrimination depth feature representations of remote sensing scene images from a statistical perspective. Moreover
with the purposes of overcoming the limitation that transformer only utilizes classified tokens as classification features
this paper employs both classified tokens and high-order feature tokens into the softmax classifier for improving the performance of remote sensing image scene classification
and this style verifies the effectiveness of high-order feature tokens for remote sensing image scene classification.
(Result)
2
The experimental results illustrate that while compared with related algorithms such as CFDNN
GLDBS
GAN
GCN
D-CapsNet
SCCov
ViT and SCViT
THViT achieves better performance on the NWPU-RESISC45 dataset and AID dataset.
(Conclusion)
2
The research results have verified that the coarse to fine dynamic two-stage and high-order features can achieve excellent performance in the field of remote sensing scene classification.
遥感图像场景分类Transformer网络特征表示高阶特征
remote sensing imagesscene classificationtransformer networkfeature representationhigh-level features
Chaib S, Liu H, Gu Y F and Yao H X. 2017. Deep feature fusion for VHR remote sensing scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(8): 4775-4784 [DOI: 10. 1109/TGRS.2017.2700322http://dx.doi.org/10.1109/TGRS.2017.2700322]
Cheng G, Han J W and Lu X Q. 2017. Remote sensing image scene classification: benchmark and state of the art. Proceedings of the IEEE, 105(10): 1865-1883 [DOI: 10.1109/JPROC.2017.2675998http://dx.doi.org/10.1109/JPROC.2017.2675998]
Chen M Z, Lin M B, Li K, Shen Y H, Wu Y J, Chao F and Ji R R. 2023. CF-ViT: a general coarse-to-fine method for vision transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6): 7042-7052.
Deng P F, Huang H and Xu K J. 2020. A deep neural network combined with context features for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters, 19: 1-5 [DOI: 10.1109/LGRS.2020.3016769http://dx.doi.org/10.1109/LGRS.2020.3016769]
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J and Houlsby N. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations. arXiv: 2010.11929.
He, L. Fang, S. Li, J. Plaza and A. Plaza. 2020. Skip-connected covariance network for remote sensing scene classification. IEEE Transactions on Neural Networks and Learning Systems, 31(5): 1461-1474 [DOI: 10.1109/TNNLS.2019.2920374http://dx.doi.org/10.1109/TNNLS.2019.2920374]
Jiang Z H, Hou Q B, Yuan L, Zhou A Q, Shi Y J, Jin X J, Wang A R and Feng J S. 2021. All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems, 34: 18590-18602.
Liu Z, Lin Y T, Cao Y, Hu H, Wei Y X, Zhang Z, Lin S and Guo B N. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022.
Lv P Y, Wu W J, Zhong Y F, Du F and Zhang L P. 2022. SCViT: A spatial-channel feature preserving vision transformer for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing, 60: 1-12. [DOI: 10.1109/TGRS.2022.3157671http://dx.doi.org/10.1109/TGRS.2022.3157671]
Ma A L, Yu N, Zheng Z, Zhong Y F and Zhang L P. 2022. A supervised progressive growing generative adversarial network for remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing, 60: 1-18 [DOI: 10.1109/TGRS.2022.3151405http://dx.doi.org/10.1109/TGRS.2022.3151405]
Mak, D K. 2021. Exponential moving average. Trading Tactics in the Financial Market, Management for Professionals [DOI: https://doi.org/10.1007/978-3-030-70622-7_4https://doi.org/10.1007/978-3-030-70622-7_4]
Qian X L, Li J, Cheng G, Yao X W, Zhao S N, Chen Y B and Jiang L Y. 2018. Evaluation of the effect of feature extraction strategy on the performance of high-resolution remote sensing image scene classification. Journal of Remote Sensing, 22(5): 758-776
钱晓亮, 李佳, 程塨, 姚西文, 赵素娜, 陈宜滨, 姜利英. 2018. 特征提取策略对高分辨率遥感图像场景分类性能影响的评估. 遥感学报, 22(5): 758-776 [DOI: 10.11834/jrs.20188015http://dx.doi.org/10.11834/jrs.20188015]
Raza A, Huo H, Sirajuddin S and Fang T. 2020. Diverse capsules network combining multiconvolutional layers for remote sensing image scene classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13: 5297-5313 [DOI: 10.1109/JSTARS.2020.3021045http://dx.doi.org/10.1109/JSTARS.2020.3021045]
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D and Batra D. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy: IEEE:618-626 [DOI: 10.1109/ICCV.2017.74http://dx.doi.org/10.1109/ICCV.2017.74]
Székely G J, Rizzo M L. 2009. Brownian distance covariance. The Annals of Applied Statistics, 3(4): 1236-1265.
Székely G J, Rizzo M L and Bakirov N K. 2007. Measuring and testing dependence by correlation of distances. The Annals of Applied Statistics, 35(6): 2769-2794 [DOI: 10.1214/009053607000000505http://dx.doi.org/10.1214/009053607000000505]
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A and Jegou H. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning, 139: 10347-10357.
Wang Y H, Hu Y X, Xu Y Z, Jiao P Y, Zhang X R and Cui H Y. 2021a. Context residual attention network for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters, 19: 1-5 [DOI: 10.1109/LGRS.2021.3117265http://dx.doi.org/10.1109/LGRS.2021.3117265]
Wang Y L, Huang R, Song S J, Huang Z Y and Huang G. 2021b. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In Advances in Neural Information Processing Systems, 34: 11960-11973.
Wang W H, Xie E, Li X, Fan D P, Song K T, Liang D, Lu T, Luo P and Shao L. 2021c. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 568-578.
Wang Q L, Xie J T, Zuo W M, Zhang L, Li P H. 2021d. Deep cnns meet global covariance pooling: Better representation and generalization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8): 2582-2597. [DOI: 10.1109/TPAMI.2020.2974833http://dx.doi.org/10.1109/TPAMI.2020.2974833]
Xia G S, Hu J W, Hu F, Shi B G, Bai X, Zhong Y F, Zhang L P and Lu X Q. 2017. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing, 55(7): 3965-3981 [DOI: 10.1109/TGRS.2017.2685945http://dx.doi.org/10.1109/TGRS.2017.2685945]
Xie J T, Long F, Lv J M, Wang Q L and Li P H. 2022. Joint distribution matters: Deep brownian distance covariance for few-shot classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7972-7981.
Xie J T, Zeng R R, Wang Q L, Zhou Z Q and Li P H. 2021. SoT: Delving deeper into classification head for transformer. arXiv:2104.10935v2
Xu K J, Deng P F and Huang H. 2022. Vision transformer: An excellent teacher for guiding small networks in remote sensing image scene classification. IEEE Transactions on Geoscience and Remote Sensing, 60: 1-15 [DOI: 10.1109/TGRS.2022.3152566http://dx.doi.org/10.1109/TGRS.2022.3152566]
Xu K J, Huang H and Deng P F. 2021a. Remote sensing image scene classification based on global–local dual-branch structure model. IEEE Geoscience and Remote Sensing Letters, 19: 1-5 [DOI: 10.1109/LGRS.2021.3075712http://dx.doi.org/10.1109/LGRS.2021.3075712]
Xu K J, Huang H, Deng P F and Li Y. 2021b. Deep feature aggregation framework driven by graph convolutional network for scene classification in remote sensing. IEEE Transactions on Neural Networks and Learning Systems, 33(10): 5751-5765 [DOI: 10.1109/TNNLS.2021.3071369http://dx.doi.org/10.1109/TNNLS.2021.3071369]
Yang R, Pu F L, Xu Z Z, Ding C J and Xu X. 2021. DA2Net: Distraction-attention-driven adversarial network for robust remote sensing image scene classification. IEEE Geoscience and Remote Sensing Letters, 19: 1-5 [DOI: 10.1109/LGRS.2021.3079248http://dx.doi.org/10.1109/LGRS.2021.3079248]
Zhang L F, Peng M Y, Sun X J, Cen Y and Tong Q X. 2019. Progress and bibliometric analysis of remote sensing data fusion methods (1992-2018). Journal of Remote Sensing, 23(4): 603-619
张立福, 彭明媛, 孙雪剑, 岑奕, 童庆禧. 2019. 遥感数据融合研究进展与文献定量分析(1992—2018). 遥感学报, 23(4): 603-619 [DOI: 10.11834/jrs.20199073http://dx.doi.org/10.11834/jrs.20199073]
Zhao Z C, Li J Q, Luo Z, Li J and Chen C. 2020. Remote sensing image scene classification based on an enhanced attention module. IEEE Geoscience and Remote Sensing Letters, 18(11): 1926-1930 [DOI: 10.1109/LGRS.2020.3011405http://dx.doi.org/10.1109/LGRS.2020.3011405]
相关作者
相关机构