A new paper has been accepted in the Workshop on Document Images and Language (DIL) of the 16th International Conference on Document Analysis and Recognition. Authors of this paper (Mohamed KERROUMI, Othmane SAYEM and Aymen SHABOU) have proposed a new multimodal approach for information extraction from scanned documents
This new approach, called VisualWordGrid, involves the use of a novel representation for scanned document, which allows for the simultaneous encoding of textual, visual, and layout information. The authors have improved upon recent models, including the Chargrid and Wordgrid models. They incorporated the visual modality and increased the robustness of the model with respect to small datasets while maintaining low inference time.
This research makes a significant contribution to the field of document analysis and recognition by introducing a novel approach that incorporates the visual modality into the information extraction task. It represents one of the pioneering methods that have achieved this, providing a more comprehensive representation of scanned documents.
Link to the paper : VisualWordGrid