functions. The config.yaml file holds the project
configuration, such as the models and paths used.
Figure 1: Proposed method architecture.
2.3 Method Steps Description
The following steps are used to read PDF documents
that contains resolution notes and procedures used to
resolve problems or tickets. Once the PDF processing
is approved, other document types can follow the
same steps:
1. Read PDF documents, and split them into text
chunks, and store these document text chunks,
indexes, embeddings, and metadata. Since free
and downloadable models like the ones available
on the site of Hugginface
5
are limited to 512
tokens, we cannot process all the text from one
document at the same time. For example, an
average document contains about 16,000
characters, which averages 2,000-3,000 tokens.
So, the text must be split into chunks. The
splitting of the text must be done by preserving
full sentences, and for that, an NLP (Natural
Language Processing) toolkit like SpaCy, pysbd,
NLTK and other packages, specialized for
French language, is necessary such as:
- Mistral-7B can handle 8k tokens but needs
a powerful GPU like A100 with 40 GB of
VRAM (Siino, M, 2024);
- Mixtral can handle 32k tokens, but it is not
downloadable and not free (Lermen, S,
2023);
2. Using Transformers:
Use a Sentence Transformer model to index each
PDF document. The choice of a sentence transformer
and not a usual transformer is crucial since we are
interested in the meaning of each sentence in context
and not a word or group of words. Some sentence
5
https://huggingface.co/
transformers for French extracted from Hugging Face
include:
a) Lajavaness sentence-flaubert-base ,
b) Lajavaness/sentence-camembert-large ,
c) paraphrase-multilingual-mpnet-base-v2 ,
d) paraphrase-multilingual-MiniLM-L12-v2
e) all-MiniLM-L6-v2
For the Question Answering task, we use: FAISS
index search to search sentences that correspond to a
question. We use also cosine similarity to find the text
chunk indexes that correspond to a question, we
extract the top k sentences, and perform a re-ranking
of these sentences using a transformer specialized in
QA to extract the response. We use the cloud model
openai to perform indexing and question-answering.
3. Based on the tasks described above, we move to
the process the generated PDF document and
retrieve data according to a specific template.
4. Find a description for one document by
performing successive summarization tasks
using a specialized transformer model.
5. Perform classification on each description to find
similar topics. For example: (a) Softmax and
feedforward from Torch; or (b) FlauBERT.
6. Use the same process for QA as above, but this
time store the position of each image relative to
the text from each PDF document. When
constructing the KB based on the template, insert
the images at the same positions relative to the
text.
7. Develop a streamlit tools that contain three tabs:
(a) Indexing documents; (b) Testing different
models and techniques; and (c) Batch for batch
processing a complete folder;
2.3.1 Text Splitting Task
The purpose of this phase is to read the text from a
PDF file, clean it by removing extra whitespaces,
recurrent dots, etc., and then split it into full sentences
using an NLP toolkit like SpaCy (
Kumar, M.,2023)
,
pysbd, NLTK (
Yao, J. (2019))
. Then the sentences can
be grouped into chunks based on the total number of
characters or total number of words. The latter
requires some extra processing using NLPs for
splitting the text into words and counting them. The
result of this operation is a list of text chunks.
2.3.2 Indexing Task
The purpose of this task is to retrieve the embeddings
for each chunk and save them as FAISS indexes,