In February 2023, the DataLab Group participated in the Industrial Forum (FIIA) organized by AFIA (Association Française pour l’Intelligence Artificielle), where they presented their work on the Chat Doc project. This project is a tool that utilizes a Large Language Model (LLM) to extract information from documents.
The DataLab Group participated in the Industrial Forum organized by AFIA, where they discussed the progress and experimentation of the ChatDoc project. The concept is simple: using a Large Language Model (LLM) to extract information from documents by inputting the OCR and requesting the desired fields.
To improve results, a small model (Flan+T5) was fine-tuned for better performance. While the results have not yet reached the level of the best specialized model like DocParser, they show promise and encouragement. It is interesting to note that the model learns information about the documents that is not directly included in the training data. For instance, even when we exclude data related to car accidents in the training set, the model still manages to capture that information.