YOLOLayout: Multi-Scale Cross Fusion Former for Document Layout Analysis
DOI:
https://doi.org/10.62677/IJETAA.2402106Keywords:
Document Layout Analysis, Document Object Detection, Document StructureAbstract
Document layout analysis (DLA) is a technique used to locate and classify layout elements in a document, such as Table, Figure, List, and Text. While deep-learning-based methods in computer vision have shown excellent performance in detecting Text and Figures, they are still unsatisfactory in accurately recognizing the blocks of List, Title, and Table categories with limited data. To address this issue, we propose a single-stage DLA model that incorporates a Multi-Scale Shallow Visual Feature Enhancement Module (MS-SVFEM) and a Multi-Scale Cross-Feature Fusion Module (MS-CFF). The MS-SVFEM extracts multi-scale spatial information through the channel attention module, spatial attention module, and multi-branch convolution. The MS-CFF fuses different level features through an attention mechanism. The experiments showed that the mAP accuracy of YOLOLayout compared to the baseline model is 2.2% and 1.5% higher on the PubLayNet Dataset and the ISCAS-CLAD dataset.
Downloads
References
L. Ding, A. Goshtasby, On the canny edge detector,Pattern Recognition 34 (3) (2001) 721–725.
J. Kittler, On the accuracy of the sobel edge detector,Image and Vision Computing 1 (1) (1983) 37–42.
Y. Soullard, P. Tranouez, C. Chatelain, S. Nicolas, T. Paquet, Multi-scale gated fully convolutional densenets for semantic labeling of historical newspaper images, Pattern recognition letters (131-Mar.).
J.-S. Lim, M. Astrid, H.-J. Yoon, S.-I. Lee, Small object detection using context and attention, in: 2021 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), 2021, pp. 181–186. doi:10.1109/ICAIIC51459.2021.941521
X. Zhong, J. Tang, A. J. Yepes, Publaynet: largest dataset ever for document layout analysis, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2019, pp. 1015–1022.
A. Asi, R. Cohen, K. Kedem, J. El-Sana, Simplifying the reading of historical manuscripts, in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2015, pp. 826–830.
W. Swaileh, K. A. Mohand, T. Paquet, Multi-script iterative steerable directional filtering for handwritten text line extraction, in: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2015, pp. 1241–1245.
F. Shafait, T. M. Breuel, The effect of border noise on the performance of projection-based page segmentation methods, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (4) (2010) 846–851.
F. Shafait, J. Van Beusekom, D. Keysers, T. M. Breuel,Background variability modeling for statistical layout analysis, in: 2008 19th International Conference on Pattern Recognition, IEEE, 2008, pp. 1–4.
T. A. Tran, I.-S. Na, S.-H. Kim, Hybrid page segmentation using multilevel homogeneity structure, in: Proceedings of the 9th International Conference on Ubiquitous Information Management and Communication, 2015, pp.1–6.
M. Mehri, P. H ́eroux, P. Gomez-Kr ̈amer, R. Mullot, Texture feature benchmarking and evaluation for his-orical document image analysis, International Journal on Document Analysis and Recognition (IJDAR) 20 (1) (2017) 1–35.
Y. Lu, C. L. Tan, Constructing area voronoi diagram in document images, in: Eighth International Conference on Document Analysis and Recognition (ICDAR’05), IEEE,2005, pp. 342–346.
N. Vasilopoulos, E. Kavallieratou, Complex layout analysis based on contour classification and morphological operations, in: Proceedings of the 9th Hellenic Conference on Artificial Intelligence, 2016, pp. 1–4.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1 (4) (1989) 541–551.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30.
D. He, S. Cohen, B. Price, D. Kifer, C. L. Giles, Multi-scale multi-task fcn for semantic page segmentation and table detection, in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), 2017.
S. A. Oliveira, B. Seguin, F. Kaplan, dhsegment: A generic deep-learning approach for document segmentation, in: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2018.
Long, Jonathan, Shelhamer, Evan, Darrell, Trevor, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis & Machine Intelligence.
Y. Xu, F. Yin, Z. Zhang, C.-L. Liu, et al., Multi-task layout analysis for historical handwritten documents using fully convolutional networks., in: IJCAI, 2018, pp.1057–1063.
S. Schreiber, S. Agne, I. Wolf, A. Dengel, S. Ahmed, Deepdesrt: Deep learning for detection and structure recognition of tables in document images, in: 2017 14th IAPR international conference on document analysis and recognition (ICDAR), Vol. 1, IEEE, 2017, pp. 1162–1167.
T.-Y. Lin, P. Doll ́ar, R. Girshick, K. He, B. Hariharan,S. Belongie, Feature pyramid networks for object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.
G. Ghiasi, T.-Y. Lin, Q. V. Le, Nas-fpn: Learning scalable feature pyramid architecture for object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7036–7045.
M. Tan, R. Pang, Q. V. Le, Efficientdet: Scalable and efficient object detection, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10781–10790.
S. Liu, D. Huang, Y. Wang, Learning spatial fusion for single-shot object detection, arXiv preprint arXiv:1911.09516.
X. Wu, Z. Hu, X. Du, J. Yang, L. He, Document layout analysis via dynamic residual feature fusion.
X. Wu, Y. Zheng, T. Ma, H. Ye, L. He, Document image layout analysis via explicit edge embedding network, Information Sciences 577 (2021) 436–448.
S. Li, X. Ma, S. Pan, J. Hu, L. Shi, Q. Wang, Vtlayout:Fusion of visual and text features for document layout analysis, in: PRICAI 2021: Trends in Artificial Intelligence, Springer International Publishing, Cham, 2021,pp. 308–322.
S. Mehta, M. Rastegari, Mobilevit: Light-weight,general-purpose, and mobile-friendly vision transformer.
H. Zhao, J. Shi, X. Qi, X. Wang, J. Jia, Pyramid scene parsing network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.2881–2890.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, in: European conference on computer vision, Springer, 2016, pp. 21–37.
S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: NIPS, 2016.
K. He, G. Gkioxari, P. Doll ́ar, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
Downloads
Published
Issue
Section
Categories
License
Copyright (c) 2024 Zhangchi Gao, Shoubin Li (Author)
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.