The DataLab Groupe has developed a new generation of information extraction models from scanned documents. This work has been initiated during the internship of Mohamed DHOUIB, and received the research internship prize during the graduation day at the Ecole Polytechnique on december 9, 2022.

Information extraction from scanned documents is a difficult task which has attracted increasing interest lately due to its usefulness in various document processing use cases.

A robust extraction model should meet some criteria:

  • Performing well on information extraction tasks
  • Be fast/efficient
  • Learning from little annotated data
  • Generalizing to new data layouts

The DataLab Groupe developed thanks to its R&D a new generation of information extraction model which meets the previous criteria and breaks free from some limitations of current generations of information extraction models.

The new approach named “DocParser” will be shared with the scientific community at the ICDAR 2023 conference.

This work has been initiated during the internship of Mohamed DHOUIB, and received the research internship prize during the graduation day at the Ecole Polytechnique on december 9, 2022.

Current information extraction approaches and their limitations

Current approaches

Current approaches for information extraction often use a third party OCR with visual information. OCR solutions allow to read all the words inside of the document. Extraction follows two steps:

  • The document is OCRized
  • Two approaches can then be used
    • A model detects the information based on visual data and the text acquired through OCR and then extracts the relevant fields
    • The model detects the fields to extract using the text acquired through OCR potentially using visual data

Limitations of current approaches

Current OCR based approaches are limited

  • They are always limited by the performances of the OCR.
  • Training requires visual and positional annotations (bounding boxes around the relevant fields) in addition to the textual values. That kind of annotation is costly.

DocParser: a new generation of information extraction

To overcom the previous limitations, the DataLab Groupe developed an innovating end-to-end information extraction approach called DocParser.

Contrary to previous approaches, DocParser does not require an OCR module and only needs textual annotations only for training. Furthermore, DocParser improves extraction performance while being faster.

The model has an encoder-decoder architecture. The input is the image from which we want to extract fields. The outputs are the values of the fields.

Here are the processing steps:

  • The encoder encodes the image into a grid containing the most important visual information,
  • The decoder seraches for the position of the field based on the grid,
  • The decoder reads the value of the field and extracts character by character.

What does that mean to the Credit Agricole?

The DataLab Groupe ensures constant improvement to its AI extraction information solution.

Through those works, the DataLab Groupe has come up with a new generation of information extraction models which performs better tha previous ones and state-of-the-art and is freed from previous limitations. It will be published in the ICDAR 2023 conference.