Efﬁcient Use of Large Language Models for Analysis of Text Corpora

David Adamczyk

1 a

and Jan H

ula

1,2 b

Institute for Research and Applications of Fuzzy Modeling, University of Ostrava, Ostrava, 701 03, Czech Republic

Czech Technical University in Prague, Prague, Czechia, Czech Republic

Keywords:

Large Language Models, Embeddings, Word Representations, Sentence Classiﬁcation, Efﬁcient Pipeline.

Abstract:

In this paper, we propose an efﬁcient approach for tracking a given phenomenon in a corpus using natural

language processing (NLP) methods. The topic of tracking phenomena in a corpus is important, especially in

the ﬁelds of sociology, psychology, and economics, which study human behavior in society. Unlike existing

approaches that rely on universal large language models (LLMs), which are computationally expensive, we

focus on using computationally less expensive methods. These methods allow for high data processing speed

while maintaining high accuracy. Our approach is inspired by the cascade approach to optimization, where

we ﬁrst roughly ﬁlter out unwanted information and then gradually use more accurate models, which are

computationally more expensive. In this way, we are able to process large amounts of data with high accuracy

using different models, while also reducing the overall cost of computations. To demonstrate the proposed

method, we chose a task that consists of ﬁnding the frequency of occurrence of a certain phenomenon in a

large text corpus, which is divided into individual months of the year. In practice, this means that we can, for

example, use Internet discussions to ﬁnd out how much people are discussing a particular topic. The entire

solution is presented as a pipeline, which consists of individual phases that successively process text data using

methods selected to minimize the overall cost of processing all data.

1 INTRODUCTION

In this work, we present an efﬁcient text mining

pipeline for large text corpora. Text mining is a pro-

cess of extracting knowledge from unstructured text

data. It can be used to identify patterns and trends

in large text corpora, extract keywords and phrases,

classify text into different categories, and summarize

the content of text documents. It has a wide range

of applications, including market research, customer

sentiment analysis, and fraud detection.

With the growing popularity of Large Language

Models (LLMs), which excel in natural language pro-

cessing tasks, there is an increasing demand for these

applications because new opportunities become avail-

able. However, instead of relying solely on LLMs,

our aim is to design efﬁcient pipelines that have much

lower hardware requirements and create models spe-

cialized for a given task. Our work focuses on pro-

cessing large text corpora of a given domain and we

are interested in tracking various phenomena in these

corpora. Concretely, we are interested in tracking a

https://orcid.org/0000-0002-7794-104X

https://orcid.org/0000-0001-7639-864X

given phenomenon in time.

Such problems may arise in various ﬁelds such

as sociology, psychology, or economics, where re-

searchers may be interested in analyzing the beliefs or

opinions of a population through an analysis of online

discussions. For example, one can study how opin-

ions about a given topic change in connection with

a given event. It is clear that if an event is very im-

portant, people will talk about it online and express

their opinions. By analyzing online discussions, we

can track the opinions of people and get a better un-

derstanding of how they perceive the event.

The essence of our approach is to gradually apply

several classiﬁcation methods in order to effectively

balance computational cost and the speed of computa-

tion. The task we chose to demonstrate our proposed

approach is to calculate the frequencies of a moni-

tored event in individual months. In simple terms, we

want to count how often people discuss topics such

as political elections or weather in given months. For

this purpose, we will use a data set containing com-

ments from social networks, where individual com-

ments are divided by month from 2019 to 2022. A

naive approach is to perform inference for each com-

ment using a large language model. Such an approach

Adamczyk, D. and H ˚ula, J.

Efﬁcient Use of Large Language Models for Analysis of Text Corpora.

DOI: 10.5220/0012349800003654

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2024), pages 695-705

ISBN: 978-989-758-684-2; ISSN: 2184-4313

695

is both very expensive and time-consuming. We want

to use a combination of techniques, where some ex-

cel in high speed and low computational cost but have

lower accuracy, while other techniques excel in high

accuracy but are computationally very demanding. A

cascade of these techniques forms a pipeline that ﬁrst

very cheaply and quickly ﬁlters out irrelevant com-

ments and then gradually processes a smaller amount

using more accurate techniques so that the entire pro-

cess is accurate and at the same time fast with the low-

est possible computational cost.

2 RELATED WORK

Modern quantitative text analysis in sociology is

largely based on research in mass communication that

focused on the press and political propaganda in the

1930s and 1940s (Krippendorff, 2018).

In recent years, the development of natural lan-

guage processing and machine learning has offered

new opportunities for using computers for text anal-

ysis in the social sciences. Many of these methods

are described in (Macanovic, 2022), which divides the

main approaches into several categories.

The identiﬁcation of phenomena in large lan-

guage corpora (Sp

orlein and Schlueter, 2021) used

a dictionary-based method to analyze text data from

YouTube comments on migration-related videos to

study the development of ethnic slurs against multi-

ple minority groups during the large inﬂux of refugees

to Germany in 2015. Topic modeling is increas-

ingly being used in the humanities, literary studies,

history, and also in political science (T

ornberg and

ornberg, 2016). Many of these approaches use LDA

or Bayesian approaches to infer latent topics in news

articles, scientiﬁc journals, or blog posts.

There are analyses that focus directly on the use

of large language models in the context of their use

for the analysis of large amounts of data. Individual

analyses can also be thematically focused according

to the analyzed domain. The topic of global warm-

ing and the use of ChatGPT is addressed by (Biswas,

2023a). Another interesting domain could be the use

in healthcare (Biswas, 2023b), (Patel and Lam, 2023).

These analyses focus speciﬁcally on the use of Chat-

GPT.

Some users consider language models to be dis-

ruptive technologies, and their use is evident not

only in academia and education. The purpose of

the study (T

ornberg, 2023) is to analyze data from

the LinkedIn network in the form of user comments

about the ChatGPT tool. This study shows that pla-

giarism, references, citations, and reviews of the liter-

ature are among the topics that raise the greatest con-

cerns among these users.

Another analysis of ChatGPT users from a group

of academics is presented in (Bukar et al., 2023). This

analysis shows a division of the academic community

into a group of enthusiasts and a group that has sig-

niﬁcant reservations about the use of LLMs. Among

other things, it can be inferred from the comments that

a large part of the users are very impressed with Chat-

GPT’s performance and its potential to help with ac-

tivities related to research, data science, data analysis,

or writing. However, the use of this tool also raises

many ethical questions among users, especially in the

educational sector, where some activities such as writ-

ing essays may be disrupted.

The use of language models (LMs) in sociological

research is described in (Jensen et al., 2022). The au-

thors show that LMs are well-suited for tasks where

researchers are interested in a speciﬁc minority group

and cannot obtain a large amount of training data.

Their illustrative analysis contributes to the sociolog-

ical study of religion. They demonstrate that the pro-

posed method is a cheap and unobtrusive way to mea-

sure religiosity in a large population in a large geo-

graphic area. Although data annotation can be very

expensive, the described method is very affordable,

requiring only a few thousand annotations to classify

datasets of up to 10 million observations. Of note

is the high versatility of the described method. The

approach is applicable in any environment where re-

searchers want to unobtrusively measure attitudes and

beliefs, but surveys can be very expensive and high-

quality data are scarce.

Our goal is to analyze information using tools

such as LLMs, similar to the studies mentioned above.

However, we want to take a different approach than

simply using these tools for a given text analysis task.

Our contribution is the proposal of a method that de-

scribes how to perform these analyses using efﬁcient

algorithms with much lower computational resource

requirements and overall operating costs.

3 OUR PIPELINE

Our goal is to track a given phenomenon described by

a phrase in a text corpus. To generalize the proposed

procedure and describe it in a clear and concise way,

we decided to design it in the form of a pipeline. This

allows us to fully control the data ﬂow between the

individual pipeline components. Due to the reusabil-

ity of individual functional blocks, we decided to di-

vide the entire pipeline into three separate parts. We

will thus have a separate part for creating the dataset,

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

696

a separate part for training the model, and a separate

part for inference.

3.1 Pipeline for Creating a Dataset

1. The user selects keywords

2. We will select words that are semantically similar

to the selected keywords from the corpus

3. The resulting set of words is used to select sen-

tences from the corpus in which a word from the

set occurs.

4. We will compute embeddings for the selected sen-

tences

5. We will apply the FacilityLocation algorithm to

obtain a representative sample of data

6. We will visualize this representative sample of

data using the Atlas library for further analysis of

these data

7. We will create a dataset from the selected sen-

tences using LabelStudio, Atlas, or automatically

using the OpenAI API.

3.2 Pipeline for Training a Model

1. We will ﬁne-tune the Flan-T5 model on the

dataset created by the previous pipeline

2. We will perform model distillation of the trained

Flan-T5 model to a smaller model

3.3 Pipeline for Tracking a Given

Phenomenon in a Corpus

1. We need to select all sentences that contain a key-

word or a word with a similar meaning

2. We will perform the classiﬁcation of the selected

sentences using a distilled (small) model

3. We will compare the conﬁdence scores. If the

conﬁdence score for a given sentence is low, we

will perform the classiﬁcation using the large

Flan-T5 model.

3.4 Creating a Dataset

We need to create a dataset that will be used to train

a model, which will then be used to track a phe-

nomenon of our choice in the corpus. This requires us

to ﬁrst select all potentially relevant sentences from

the corpus, which can be easily done by identifying

words that describe the phenomenon and then select-

ing sentences that contain those words. This is the

most crude method of ﬁltering the data and may omit

relevant sentences. Nevertheless, the user may in-

clude as many words as he wants and, by this, con-

trol the recall. We expect the user to be able to deter-

mine which keywords are of interest to them. Because

the corpus may contain similar words with the same

meaning, we ﬁrst want to determine a complete set of

words on the basis of which we will ﬁlter sentences.

We will determine the complete set of words using se-

mantic similarity with the use of vector embeddings.

3.4.1 World Selection

First, we identify the individual words that describe

the phenomenon to be tracked. The number of these

words may vary depending on the context of the phe-

nomenon, for example, from 5 to 20. Because there

are often words that are similar or have the same

meaning, we want to ﬁnd such words in the corpus

and take them into account. To work effectively with

words from a semantic perspective, it is necessary to

choose a suitable representation of these words. A

commonly used approach is to convert words into

their vector representation. These vectors are de-

signed to capture hidden information about the lan-

guage or analogies between words and the semantic

meaning of these words. The FastText library has

pre-trained models for 157 languages (Grave et al.,

2018). These models are freely available online, so

they are suitable for our use. We will extract only

unique words from our corpus and convert them to

their vector representation using the FastText library.

In this step, we obtain a vector representation of

all unique words from the given corpus, as well as a

vector representation of the words we selected. To

select words from the corpus that are similar to the

words we selected, we will use these vectors and cal-

culate the cosine distance between the selected words

and the words from the corpus. In this way, we will

select the most similar semantic words n to each of

the selected words using the cosine distance. In our

case n = 10. This can be done inexpensively using

libraries, such as HNSWLIB. HNSWLIB is based on

the Hierarchical Navigable Small World (HNSW) al-

gorithm (Malkov and Yashunin, 2018), which facili-

tates a fast approximate search for nearest neighbors.

By creating a multilayer structure of navigable small-

world graphs, HNSWLIB can quickly traverse data

and identify the nearest data points, which is espe-

cially useful for applications for large datasets where

speed and accuracy are critical.

The uppermost layers contain sparse data, allow-

ing rapid navigation over large distances, while the

lower layers become increasingly dense, allowing for

ﬁner navigation. When a search query is initiated,

Efﬁcient Use of Large Language Models for Analysis of Text Corpora

697

the algorithm begins from the top layer and traverses

down, focusing on the approximate nearest neighbors

in an efﬁcient manner. With the help of HNSWLIB,

we can retrieve semantically relevant words in a few

milliseconds.

3.4.2 Selection of Training Sentence

The purpose of this step is to ﬁlter out sentences that

are not relevant to our target task. We have a set of

selected keywords, and the fastest solution is to se-

lect only those sentences that contain at least one of

the selected keywords. Due to such a rough ﬁltering,

which is also very fast, we can only work with rele-

vant data in the next steps, which signiﬁcantly affects

the speed of the entire pipeline.

From the perspective of our main task, which is to

calculate the frequency of occurrence of a given phe-

nomenon in individual months, we are currently able

to go through selected sentences that contain selected

words, but we are not able to accurately determine

whether these sentences refer to our phenomenon.

This is because a single word can appear in differ-

ent contexts and thus change the meaning of the sen-

tence. Therefore, we would like to further identify

only those sentences that semantically correspond to

this phenomenon. A possible solution is to use a clas-

siﬁer. We want to train this classiﬁer to accurately

identify our phenomenon, thus further removing irrel-

evant sentences that contain the selected words but are

in a different context. In the following steps, we want

to fully focus on assembling the dataset and training

the model.

As can be seen in Figure 1, the embeddings

form clusters based on their semantic similarity. We

would like to create a training dataset from sen-

tences that well represent these clusters. Therefore,

we select representative sentences in such a way that

their neighborhoods cover most of the embedded sen-

tences. This can be formalized as a facility location

problem that can be approximately solved with algo-

rithms for submodular optimization.

Submodular optimization is a mathematical

framework for ﬁnding the best subset of a set of el-

ements, such as data points, features, or tasks. Algo-

rithms for submodular optimization can be accessed

from Python libraries such as Apricot (Schreiber

et al., 2020). Apricot provides a variety of algorithms

for submodular optimization, including greedy algo-

rithms, local search algorithms, and approximation

algorithms. It also includes a number of features to

visualize and evaluate the results of submodular opti-

mization.

Facility location functions (Krause and Golovin,

2014) are a category of submodular functions that,

when maximized, select data points that aptly repre-

sent the entire data space. The principle behind the fa-

cility location function is the optimization of pairwise

similarities between the data points and their closest

selected point. It is imperative for the similarity met-

ric, although user-speciﬁed, to remain non-negative;

a higher value suggests increased similarity. Optimiz-

ing a facility location function can be seen as a more

greedy variant of the k-medoids algorithm. After the

initial data points are selected, subsequent selections

tend to be in the cluster centers. This function, simi-

lar to many graph-based functions, operates on a pair-

wise similarity matrix. It progressively selects points

that are more similar to those points whose most sim-

ilar counterparts remain quite dissimilar. In other

words, the data points chosen in subsequent stages

tend to represent the less represented sections of the

data. The mathematical representation of a facility lo-

cation function is:

f (X, Y ) =

∑

y∈Y

max

x∈X

φ(x, y) (1)

In the equation 1, f represents the function, X denotes

a subset, Y signiﬁes the ground set, and φ embodies

the similarity measure between two points.

In our case, the ground set Y is the set of all sen-

tences and the subset X is the set of sentences that

we want to use for the training dataset. Therefore,

we want to minimize the function f (X, Y ) over dif-

ferent choices of X . The similarity of sentences φ

can be measured using the cosine distance between

the embeddings of sentences. To create the embed-

dings, we use the pre-trained model all-mpnet-base-

v2 (Song et al., 2020) from the Sentence Transform-

ers library (Reimers and Gurevych, 2019).

3.4.3 Data Labeling

Once we have a representative sample of sentences,

we need to label them so we can use the sentences

to ﬁne-tune a pre-trained classiﬁer. In Section 4, we

will demonstrate our pipeline on two examples used

to check whether it produces expected results. In

one of these examples, we track the occurrence of

sentences that express an opinion about the current

weather. Therefore, we need the classiﬁer to distin-

guish whether a given sentence/comment is about the

weather or not. The proposed labels can thus have

values of Weather or Irrelevant. For a more detailed

division, we can choose labels such as Hot, Cold, Ir-

relevant, thereby teaching the model to also distin-

guish whether a given sentence/comment implies that

the temperature is high/low. We used three different

ways to label the sentences. Given the available tech-

nologies, we would like to give users the maximum

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

698

possible control over the assignment of labels to indi-

vidual sentences, but at the same time, we want to fo-

cus on an appropriate level of automation. Therefore,

it seems appropriate to allow for the use of multiple

approaches. The most expensive way to label is to

manually assign a label to each sentence based on the

judgment of an expert user. This can be supplemented

by other approaches that allow for partial automation,

such as ﬁnding clusters and labeling whole clusters

instead of individual examples. Finally, we have the

most automatic way to assign labels to sentences, for

example using the OpenAI API.

Manual Labeling. This kind of labeling gives the

user maximum control over the quality of the labels.

There is a wide range of software available for man-

ual labeling, and some of them allow the use of pre-

trained models, which can signiﬁcantly simplify the

task. In our case, we used LabelStudio, in which we

uploaded a pre-prepared dataset containing 250 se-

lected sentences and assigned the appropriate class to

each sentence after reading it.

Labeling with OpenAI API. To maximize the au-

tomation of labeling, one can also use one of the ex-

isting commercial LLMs. In our case, we used the

OpenAI API to label another 200 selected sentences.

Labeling with Atlas. As a result of the previous

step described in Section 3.4.2, we have access to

sentence embeddings and their corresponding sen-

tences. We would like to analyze these data and po-

tentially perform topic modeling, clustering, and se-

mantic search on them. Individual sentences that form

clusters have high semantic similarity. This informa-

tion can be used for dataset construction, as it will

facilitate labeling and allow us to easily label all sen-

tences from a single cluster with a single label. In

ﬁgure 1, you can see the individual embedding clus-

ters that were obtained as part of the experiment in

Section 4.1.

This visualization was generated using the Nom-

icAI Atlas tool, which is used for the analysis and

visualization of multidimensional vector representa-

tions in a two-dimensional space and provides tools

suitable for the creation of datasets.

3.5 Training a Model

The model is the core of our entire pipeline, so it is

crucial to choose the right architecture that meets the

requirements of the tasks the model will be process-

ing. It is also essential to assemble a suitable dataset

on which the model will be trained.

Figure 1: The sentence clusters utilized in the experiment

4.1.

The Flan-T5 model was published in (Chung

et al., 2022). It is an instruction-ﬁnetuned version of

the T5 model (Raffel et al., 2020), a popular LLM

encoder-decoder model. This extension was created

by applying the Flan ﬁnetuning procedure (Finetun-

ing language models (Wei et al., 2021)), which means

that the model is trained on a mixture of tasks, includ-

ing text generation, translation, summarization, and

question answering. As a result, it is able to perform

a variety of NLP tasks without explicit prior training

on the given task or is able to perform tasks with a

few examples by in-context learning.

3.5.1 Low-Rank Adaption

Currently, the most common approach to training

Large Language Models (LLMs) is to pre-train them

on a general data domain and then adapt the model

to a speciﬁc domain or task. With the rise of large

LLMs, it is very computationally expensive to per-

form full ﬁne-tuning of all the network parameters.

A technique called Low-Rank Adaptation (LoRA)

(Hu et al., 2021) allows us to freeze the pre-trained

weights of the model and then inject trainable rank

decomposition matrices into each layer of the Trans-

former architecture. This allows us to very effectively

minimize the number of trainable parameters for a

given task.

Advantages:

• The pre-trained model can be shared and used

to assemble many other small LoRA modules

Efﬁcient Use of Large Language Models for Analysis of Text Corpora

699

for different tasks. We can freeze the shared

model and effectively switch between different

tasks simply by replacing the A and B matrices.

This also has a signiﬁcant effect on the utilization

of computational resources and disk space.

• The training itself is also much more efﬁcient, be-

cause LoRA does not need to compute gradients

or maintain the state of the optimizer for a large

number of parameters. Instead, it only optimizes

the injected low-rank matrices, which are much

smaller.

A Flan-T5 XXL model with 11B parameters was

selected for training. To train with the LoRA method,

we used the Peft library (Mangrulkar et al., 2022).

Due to the size of the model, it was necessary to use

the ability to split the model across multiple GPUs

using the Accelerate library (Sylvain Gugger, 2022).

For each experiment, the model was trained with

the prompts described in Section 3.5.2. The model

distillation process is then described in Section 3.5.3.

3.5.2 Prompt Construction

The way we communicate with large language mod-

els (LLMs) is by providing a speciﬁc instruction

based on our requirements for the output. This in-

struction is called a prompt. By creating a good

prompt, we can guide the LLM to generate content

that meets our speciﬁc requirements, reduce training

costs, improve productivity, and ensure good perfor-

mance. In other words, the prompt is the key to get-

ting the most out of LLMs. By taking the time to write

a clear and concise prompt, we can help the LLM to

understand what we want and generate the best possi-

ble output.

Each prompt represents an instruction (Zhang

et al., 2023) consisting of three parts: the instruction,

which speciﬁes the given task in a sequence of nat-

ural language, the input text {input text}, which in

our case represents a given sentence or comment in

the corpus, and the expected output, which can be, for

example, a set of possible classes into which we can

classify the sentence or comment on the input.

The above prompts 2 were used for individual ex-

periments for the training and inference of the Flan-

T5 model.

3.5.3 Model Distillation

Model distillation is a machine learning technique to

transfer the knowledge learned by a large and com-

plex model to a smaller and simpler model. This is

typically done by training the smaller model to pre-

dict the outputs of the larger model. Model distillation

text: {input_text} Is this text about

political election?

Options:

-Yes

-No

text: {input_text}

Does the text imply a hot weather,

cold weather or is not about weather.

Options:

-Hot

-Cold

-Irrelevant

Figure 2: Examples of used prompts: Top: prompt for an

experiment tracking political election discussion, Bottom:

prompt for an experiment tracking weather discussion.

has several advantages, including reducing the size

and complexity of large models, improving the per-

formance of small models, and increasing the robust-

ness of models to noise and adversarial attacks. How-

ever, model distillation also has some disadvantages,

such as being computationally expensive to train, es-

pecially for large and complex models, and distilled

models can be less robust than the original models

and may not perform as well on tasks that were not ex-

plicitly considered during training. In general, model

distillation is a powerful technique that can be used

to improve the performance, reduce the size, and in-

crease the robustness of machine learning models.

Methods such as parameter-efﬁcient ﬁne tuning

(PEFT) and pattern-exploiting training (PET) achieve

very good results in the training of Large Language

Models (LLMs). However, they are difﬁcult to use

because they are exposed to high variability in man-

ually created prompts and typically require billion-

parameter language models to achieve high accuracy.

These drawbacks are addressed by an efﬁcient and

prompt-free framework for few-shot ﬁne-tuning of

Sentence Transformers (ST), called Sentence Trans-

former Fine-Tuning (SetFit) (Tunstall et al., 2022).

This simple framework requires no prompts for train-

ing or inference, and achieves high accuracy with an

order of magnitude fewer parameters and an order of

magnitude shorter training time.

As mentioned above, the technique is built on top

of models from the Sentence Transformer family, so

it is advisable to use a model that is supported within

this framework directly. Comparison in (Reimers and

Gurevych, 2019) shows that the all-mpnet-base-v2

model is the best rated in terms of quality.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

700

3.6 Pipeline for Tracking Phenomenon

This phase requires processing the entire corpus, so

we want it to be as fast as possible. It is advisable to

process the corpus at several levels of granularity. The

ﬁrst step is to select only relevant sentences from the

entire corpus. This process is identical to the selection

of sentences from the ﬁrst pipeline. This will signif-

icantly reduce the number of sentences for which we

will perform inference. We then classify the selected

sentences as the second step using a distilled model.

The third step is to correct for those predictions

for which the distilled model is not conﬁdent. Here, it

is crucial to choose a threshold in a suitable way with

which we will compare the conﬁdence that the model

returns for each predicted sentence. For each task, it is

necessary to analyze all the predictions and determine

the threshold value based on this analysis. Thanks

to the appropriately chosen threshold value, we will

only perform predictions with a large model for those

sentences for which it is absolutely necessary, and we

assume that there will be few of these sentences.

4 EXPERIMENTS

To demonstrate the results of our approach, we per-

formed two experiments. We selected experiments

that follow discussions about weather and political

elections because the results of these experiments are

easily veriﬁable. Both experiments were built on the

same pipeline, with the same models, and on the same

dataset. The dataset contains sentences from social

media collected from 2019 to 2022. The sentences

are also divided into individual months in each year.

On average, there are 4.5 million sentences available

per month. The entire dataset for all years contains

over 200 million sentences. The speciﬁc number of

sentences for individual years can be found in Table

Table 1: Dataset sentences per year.

Year Sentences

2019 54442253

2020 54715529

2021 54741599

2022 54456946

4.1 Political Elections

A use case for this experiment is that a user wants to

ﬁnd out how much people comment about political

elections on social media in different months. This

is the simplest version of the experiment, where we

only look for sentences that are related to political

elections. On the other hand, we want to exclude all

sentences that are not related to political elections. In

essence, we want to verify the functionality of our

proposed approach in this experiment. For this ex-

periment, we have selected the keywords ”election”

and ”vote”, which we expect to be the most com-

mon words and semantically similar words in political

comments.

Table 2: FastText based ﬁltering for political election exper-

iment.

Year 2019 2020 2021 2022

Time 15.72s 15.24s 18.0s 16.44s

Sentences 42332 58949 34493 32412

Table 3: SetFit based ﬁltering performance.

Year 2019 2020 2021 2022

Time 50.80s 68.75s 43.51s 38.93s

You can ﬁnd information about the performance

of the individual pipeline components in table 2,

which provides information about how many sen-

tences the FastText algorithm selected based on the

given requirements and how long the ﬁltering took.

You can see the times for classifying sentences using

the SetFit model in Table 3.

4.2 Discussion About Weather

Table 4: FastText based ﬁltering.

Year 2019 2020 2021 2022

Time 79.8s 82.56s 83.28s 89.28s

Sentences 161747 152994 157420 161617

Table 5: SetFit based ﬁltering performance.

Year 2019 2020 2021 2022

Time 219.54s 212.42s 215.45s 216.41s

In the second experiment, we would like to analyze

user discussions about the weather in relation to the

individual months of the year. We expect that the

discussion will change over the course of the year and

that in hot months, discussion about hot weather will

be more frequent than discussion about cold weather,

while in cold months, discussion about cold weather

will be more frequent than discussion about hot

weather. For this experiment, we selected the follow-

ing keywords: heatwave, sunny, wildfire,

muggy, wet, windy, tornado, cloudy, rain,

rainy, weather, beach.

Efﬁcient Use of Large Language Models for Analysis of Text Corpora

701

5 RESULTS

We veriﬁed the above experiments on the dataset de-

scribed in 4. In practice, this involves determining the

frequency of occurrence of the observed phenomenon

in a corpus of text divided into 12 months for the years

2019 to 2022. For the tracking of political elections,

you can see the frequency of occurrence of such a dis-

cussion in Figure 3, 4, 5, 6. Since the original data

source is the social network Reddit and the majority

of the discussants come from the United States, we

can infer the following phenomena:

• The 2020 United States Presidential Election.

Joe Biden defeated Donald Trump and became the

46th president of the United States.

• The 2022 United States House of Represen-

tatives Elections. Republicans won a major-

ity in the House of Representatives, ending the

Democrats’ two-year majority

• The 2022 United States Senate Elections.

Democrats maintained a narrow majority in the

Senate.

Speciﬁc events may be relevant for speciﬁc

months:

• May 2020. In this month, the 2020 United States

presidential primary elections took place.

• September 2020. In this month, the presiden-

tial debates between Joe Biden and Donald Trump

took place.

• November 2020. The 2020 United States presi-

dential election took place in this month.

• February 2022. The 2022 United States House

of Representatives elections took place in this

month.

• April 2022. In this month, the 2022 United States

Senate primary elections took place.

• November 2022. The 2022 United States Senate

elections took place in this month.

The results of the second experiment, which are

described in 4.2, can be seen in ﬁgure 7, 8, 9, 10.

This experiment was about people’s discussions of

hot summer weather and cold winter weather. The

graphs show that the frequency of these discussions

corresponds to the seasons of the year. In the winter,

people discuss cold weather more often, and in the

summer, they discuss hot weather more often.

Next, we summarize our experiments in terms of

performance, which we believe is the biggest bene-

ﬁt of the proposed approach. All experiments were

conducted on an NVIDIA DGX-1 machine. Due to

the size of the Flan-T5 model in the XXL version,

Figure 3: Results for political election discussion, year

2019.

Figure 4: Results for political election discussion, year

2020.

Figure 5: Results for political election discussion, year

2021.

which uses 11B parameters, all 8 graphics cards were

used. One graphics card was used to work with the

SetFit model. First, we would like to emphasize that

the dataset used, described in section 4, contains a to-

tal of 218,356,327 sentences. For the prediction of

the Flan-T5 model, we achieved an average time of

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

702

Figure 6: Results for political election discussion, year

2022.

Figure 7: Results for discussion about the weather, year

2019.

Figure 8: Results for discussion about the weather, year

2020.

0.2052 seconds, and for the prediction of the SetFit

model, we achieved an average time of 0.0011 sec-

onds. If we were to analyze the entire dataset using

only the Flan-T5 model, we would reach a total time

of 12,446 hours, and with the SetFit model, a total

Figure 9: Results for discussion about the weather, year

2021.

Figure 10: Results for discussion about the weather, year

2022.

time of 67 hours. Here, you can see the strength of

the proposed solution, when we ﬁrst select the desired

sentences, which are orders of magnitude fewer, us-

ing a very fast ﬁltering process. We then apply the

prediction to this set using the SetFit model, which is

very fast. For sentences where the conﬁdence of the

SetFit model is less than 0.98, we reﬁne them using

the slower Flan-T5 model. Thanks to this, the overall

analysis can be very fast as you can see in table 6.

Table 6: Average ﬁltering performance in minutes.

Experiment FastText SetFit Flan-T5

Political Election 1.09m 3.08m 230.07m

Weather 5.58m 11.61m 671.93m

6 CONCLUSIONS

In this contribution, we presented a novel pipeline de-

sign for the analysis of extensive corpora, with an

emphasis on tracking speciﬁc phenomena. Our ﬁnd-

Efﬁcient Use of Large Language Models for Analysis of Text Corpora

703

ings underscore the pivotal role of data preprocess-

ing, highlighting its impact on subsequent stages, es-

pecially when interfaced with LLMs. Our incremen-

tal approach offers several advantages, including ﬂex-

ibility and the ability to adapt individual steps to align

with speciﬁc user goals. This ensures that users can

effectively sift through vast information sources, dif-

ferentiating between pertinent and non-pertinent data,

ultimately facilitating a more concentrated investiga-

tion of the investigated phenomenon.

However, we also acknowledge certain limitations

inherent in our approach. A signiﬁcant challenge

is the preparatory phase of data curation and model

training, which we aim to alleviate in future iterations

through enhanced automation of the pipeline. Addi-

tionally, the Flan-T5 model’s intensive hardware de-

mands underscore the time-intensive nature of analyz-

ing sizable corpora.

In spite of these challenges, we believe that our

approach has several advantages. Due to the way the

pipeline is designed, users are allowed to customize

the solution. Ultimately, users can decide for them-

selves how accurate they need the results at each step

and the associated computational cost of each step.

Thanks to this ﬂexibility, it is possible for users to op-

timize the task according to their requirements, either

for computational cost and speed or for overall calcu-

lation accuracy.

ACKNOWLEDGEMENTS

The work is partially supported by grant SGS13/P

rF-

MF/2023. This work was supported by the Ministry

of Education, Youth and Sports of the Czech Republic

through the e-INFRA CZ (ID:90254).

REFERENCES

Biswas, S. S. (2023a). Potential use of chat gpt in

global warming. Annals of biomedical engineering,

51(6):1126–1127.

Biswas, S. S. (2023b). Role of chat gpt in public health.

Annals of biomedical engineering, 51(5):868–869.

Bukar, U., Sayeed, M. S., Razak, S. F. A., Yogarayan, S.,

and Amodu, O. A. (2023). Text analysis of chatgpt as

a tool for academic progress or exploitation. Available

at SSRN 4381394.

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fe-

dus, W., Li, E., Wang, X., Dehghani, M., Brahma, S.,

et al. (2022). Scaling instruction-ﬁnetuned language

models. arXiv preprint arXiv:2210.11416.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and

Mikolov, T. (2018). Learning word vectors for

157 languages. In Proceedings of the International

Conference on Language Resources and Evaluation

(LREC 2018).

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang,

S., Wang, L., and Chen, W. (2021). Lora: Low-rank

adaptation of large language models. arXiv preprint

arXiv:2106.09685.

Jensen, J. L., Karell, D., Tanigawa-Lau, C., Habash, N.,

Oudah, M., and Fairus Shoﬁa Fani, D. (2022). Lan-

guage models in sociological research: an application

to classifying large administrative data and measuring

religiosity. Sociological Methodology, 52(1):30–52.

Krause, A. and Golovin, D. (2014). Submodular function

maximization. Tractability, 3(71-104):3.

Krippendorff, K. (2018). Content analysis: An introduction

to its methodology. Sage publications.

Macanovic, A. (2022). Text mining for social science–the

state and the future of computational text analysis in

sociology. Social Science Research, 108:102784.

Malkov, Y. A. and Yashunin, D. A. (2018). Efﬁcient and

robust approximate nearest neighbor search using hi-

erarchical navigable small world graphs. IEEE trans-

actions on pattern analysis and machine intelligence,

42(4):824–836.

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., and

Paul, S. (2022). Peft: State-of-the-art parameter-

efﬁcient ﬁne-tuning methods. https://github.com/h

uggingface/peft.

Patel, S. B. and Lam, K. (2023). Chatgpt: the future of

discharge summaries? The Lancet Digital Health,

5(3):e107–e108.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).

Exploring the limits of transfer learning with a uni-

ﬁed text-to-text transformer. The Journal of Machine

Learning Research, 21(1):5485–5551.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sen-

tence embeddings using siamese bert-networks. In

Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing. Associa-

tion for Computational Linguistics.

Schreiber, J., Bilmes, J., and Noble, W. S. (2020). apri-

cot: Submodular selection for data summarization

in python. Journal of Machine Learning Research,

21(161):1–6.

Song, K., Tan, X., Qin, T., Lu, J., and Liu, T.-Y. (2020). Mp-

net: Masked and permuted pre-training for language

understanding. arXiv preprint arXiv:2004.09297.

orlein, C. and Schlueter, E. (2021). Ethnic insults in

youtube comments: Social contagion and selection ef-

fects during the german “refugee crisis”. European

Sociological Review, 37(3):411–428.

Sylvain Gugger, Lysandre Debut, T. W. P. S. Z. M. S. M. M.

S. B. B. (2022). Accelerate: Training and inference

at scale made simple, efﬁcient and adaptable. https:

//github.com/huggingface/accelerate.

ornberg, A. and T

ornberg, P. (2016). Combining cda and

topic modeling: Analyzing discursive connections be-

tween islamophobia and anti-feminism on an online

forum. Discourse & Society, 27(4):401–422.

ICPRAM 2024 - 13th International Conference on Pattern Recognition Applications and Methods

704

ornberg, P. (2023). How to use llms for text analysis. arXiv

preprint arXiv:2307.13106.

Tunstall, L., Reimers, N., Jo, U. E. S., Bates, L., Korat,

D., Wasserblat, M., and Pereg, O. (2022). Efﬁcient

few-shot learning without prompts. arXiv preprint

arXiv:2209.11055.

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,

B., Du, N., Dai, A. M., and Le, Q. V. (2021). Fine-

tuned language models are zero-shot learners. arXiv

preprint arXiv:2109.01652.

Zhang, S., Dong, L., Li, X., Zhang, S., Sun, X., Wang, S.,

Li, J., Hu, R., Zhang, T., Wu, F., et al. (2023). In-

struction tuning for large language models: A survey.

arXiv preprint arXiv:2308.10792.

Efﬁcient Use of Large Language Models for Analysis of Text Corpora

705