Comparing Chain and Tree-Based Reasoning for Explainable Knowledge

Discovery in Contract Analytics Using Large Language Models

Antony Seabra

, Claudio Cavalcante

and Sergio Lifschitz

Departamento de Informatica, PUC-Rio, Brazil

Keywords:

Chain-of-Thought Prompting, Tree-of-Thought Reasoning, Contract Analytics, Knowledge Discovery, Large

Language Models, Business Intelligence, Decision Support Systems.

Abstract:

This paper presents a comparative analysis of two structured reasoning strategies—Chain-of-Thought (CoT)

and Tree-of-Thought (ToT)—for explainable knowledge discovery with Large Language Models (LLMs).

Grounded in real-world IT contract management scenarios, we apply both techniques to a diverse set of com-

petency questions that require advanced reasoning over structured and unstructured data. CoT guides the

model through sequential, linear reasoning steps, whereas ToT enables the exploration of multiple reasoning

paths before selecting a ﬁnal response. We evaluate the generated insights using three key criteria: clar-

ity, usefulness, and conﬁdence in justiﬁcations, with particular attention to their effectiveness in supporting

decision-making. The results indicate that ToT produces richer and more comprehensive rationales in complex

scenarios, while CoT offers faster and more direct responses in narrowly deﬁned tasks. Our ﬁndings highlight

the complementary strengths of each approach and contribute to the design of adaptive, self-rationalizing AI

agents capable of delivering explainable and actionable recommendations in contract analysis contexts.

1 INTRODUCTION

The increasing complexity of enterprise contracts,

particularly in sectors such as information technol-

ogy, has created a pressing demand for intelligent

systems capable of extracting, interpreting, and ex-

plaining strategic insights from both structured and

unstructured data sources. Traditional Business Intel-

ligence (BI) tools, while effective for analyzing struc-

tured databases, often fall short in addressing high-

level analytical tasks that require synthesis, inference,

and justiﬁcation, especially when dealing with hetero-

geneous information distributed across legal clauses,

performance metrics, and historical trends.

Recent advances in Large Language Models

(LLMs) have enabled the development of AI-driven

agents capable of answering complex queries us-

ing natural language and diverse knowledge sources.

However, despite their expressive capabilities, LLMs

often behave as black boxes, offering conclusions

without clear or traceable reasoning, which hinders

their adoption in critical decision-making scenarios

https://orcid.org/0009-0007-9459-8216

https://orcid.org/0009-0007-6327-4083

https://orcid.org/0000-0003-3073-3734

such as contract negotiation, compliance auditing,

and risk mitigation.

To address this challenge, structured reason-

ing strategies such as Chain-of-Thought (CoT) and

Tree-of-Thought (ToT) prompting have emerged as

promising solutions for enhancing the transparency

and interpretability of LLM-generated responses.

CoT enables step-by-step linear reasoning, guiding

the model through a structured narrative to reach its

conclusion. ToT, by contrast, simulates multi-path

reasoning: it explores diverse branches of logic in

parallel and selects the most compelling or justiﬁed

outcome.

In this paper, we perform a comparative study

of CoT and ToT techniques applied to a set of real-

world contract analysis tasks in the context of a BI

system. By reusing competency questions from pre-

vious work and applying each reasoning strategy to

the same analytical scenarios, we examine how these

approaches affect the clarity, usefulness, and con-

ﬁdence of knowledge discovery. We also evaluate

trade-offs in terms of computational cost, response

time, and user-perceived value of the generated expla-

nations. Our goal is to provide empirical insights into

the practical effectiveness of CoT and ToT in support-

ing explainable knowledge discovery in contractual

426

Seabra, A., Cavalcante, C. and Lifschitz, S.

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models.

DOI: 10.5220/0013752300004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 426-435

ISBN: 978-989-758-769-6; ISSN: 2184-3228

domains. The ﬁndings inform the design of adaptive

reasoning agents capable of selecting the appropriate

strategy depending on question complexity, user in-

tent, and the nature of the available data.

The remainder of the paper is organized as fol-

lows: Section 2 provides background on reasoning

strategies for LLM integration. Section 3 details our

methodology for comparing CoT and ToT. Section 4

introduces the system architecture and implementa-

tion. Section 5 presents the experimental evaluation.

Section 6 reviews related work, and Section 7 con-

cludes with ﬁnal remarks and directions for future re-

search.

2 BACKGROUND

2.1 Large Language Models

Large Language Models (LLMs) have revolutionized

the ﬁeld of Natural Language Processing (NLP) with

their ability to understand and generate human-like

text. At the heart of the most advanced LLMs is

the Transformers architecture, a deep learning model

introduced in the seminal paper Attention Is All You

Need by (Vaswani et al., 2017). Transformers lever-

age a mechanism called attention, which allows the

model to weigh the inﬂuence of different parts of the

input data at different times, effectively enabling it to

focus on relevant parts of the text when making pre-

dictions.

Prior to Transformers, Recurrent Neural Networks

(RNNs) and their variants like Long Short-Term

Memory (LSTM) networks were the standard in NLP.

These architectures processed input data sequentially,

which naturally aligned with the sequential nature of

language. However, they had limitations, particularly

in dealing with long-range dependencies within text

due to issues like vanishing gradients (Pascanu et al.,

2013). Transformers overcome these challenges by

processing all parts of the input data in parallel, dras-

tically improving the model’s ability to handle long-

distance relationships in text.

Chat models, a subset of LLMs, are special-

ized in generating conversational text that is coher-

ent and contextually appropriate. This specialization

is achieved through the training process, where the

models are fed vast amounts of conversational data,

enabling them to learn the nuances of dialogue. Chat-

GPT, for instance, is ﬁne-tuned on a dataset of conver-

sational exchanges and it was optimized for dialogue

by using Reinforcement Learning with Human Feed-

back (RLHF) - a method that uses human demonstra-

tions and preference comparisons to guide the model

toward desired behavior (OpenAI, 2023a).

The transformative impact of LLMs, and partic-

ularly those built on the Transformers architecture,

has been profound. By moving away from the con-

straints of sequential data processing and embracing

parallelization and attention mechanisms, these mod-

els have set new standards for what is possible in the

realm of NLP. With the ability to augment generation

with external data or specialize through ﬁne-tuning,

LLMs have become not just tools for language gen-

eration but platforms for building highly specialized,

knowledge-rich applications that can retrieve infor-

mation in a dialogue-like way, ﬁnd useful information

and generate insights for decision making.

The ability to augment the generation capabilities

of LLMs using enriched context from external data

sources is a signiﬁcant advancement in AI-driven sys-

tems. An LLM context refers to the surrounding in-

formation provided to a LLM to enhance its under-

standing and response generation capabilities. This

context can include a wide array of data, such as text

passages, structured data, and external data sources

like Knowledge Graphs. Utilizing these external data

sources allows the LLM to generate more accurate

and relevant responses without the need for retrain-

ing. By providing detailed context, such as product

attributes, user reviews, or categorical data, the model

can produce insights that are tailored and contextually

aware.

2.2 Prompt Engineering

One key aspect of providing contexts to LLMs is the

ability of designing and optimizing prompts to guide

LLMs in generating the answers. This is what is

called Prompt Engineering. Its main goal is to max-

imize the potential of LLMs by providing them with

instructions and context (OpenAI, 2023b).

In the realm of Prompt Engineering, instructions

are the crucial ﬁrst steps. Through them, engineers

can detail the roadmap to an answer, outlining the de-

sired task, style and format for the LLM’s response

(White et al., 2023). For instance, To deﬁne the style

of a conversation, a prompt could be phrased as ”Use

professional language and address the client respect-

fully” or ”Use informal language and emojis to con-

vey a friendly tone”. To specify the format of dates

in answers, a prompt instruction could be ”Use the

American format, MM/DD/YYYY, for all dates”.

On the other hand, as mentioned earlier, context

refers to the information provided to LLMs alongside

the core instructions. The most important aspect of

a context is that it can provide information that sup-

ports the answer given by the LLM, and it is very

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models

427

useful when implementing question-answering sys-

tems. This supplemental context can be presented in

various formats. One particularly effective format is

RDF triples, which represent information as subject-

predicate-object statements. RDF triples are a stan-

dardized way of encoding structured data about enti-

ties and their relationships, making them ideal for em-

bedding precise information into prompts. By includ-

ing RDF triples in a prompt, we can clearly convey

complex relationships and attributes in a format that

the LLM can easily process, leading to more accurate

and relevant responses. According to (Wang et al.,

2023), prompts provide guidance to ensure that Chat-

GPT generates responses aligned with the user’s in-

tent. As a result, well-engineered prompts greatly im-

prove the efﬁcacy and appropriateness of ChatGPT’s

responses.

2.3 Structured Reasoning Within LLMs

Structured reasoning strategies have emerged as key

techniques to enhance the transparency and perfor-

mance of Large Language Models (LLMs) in com-

plex decision-making tasks. Among them, Chain-of-

Thought (CoT) and Tree-of-Thought (ToT) prompting

stand out as two complementary approaches that dif-

fer in how they guide the reasoning process of the

model.

Chain-of-Thought (CoT) prompting encourages

the model to produce intermediate reasoning steps

in a linear and sequential fashion. By appending a

phrase such as “Let’s think step by step” to the input

prompt, the LLM is prompted to generate a coherent

narrative of thought, similar to how a human might

solve a math problem or justify a decision point by

point (Wei et al., 2022). This method improves the in-

terpretability of the model’s output by revealing how

conclusions are reached, rather than presenting only

the ﬁnal answer.

In contrast, Tree-of-Thought (ToT) expands the

reasoning space by enabling the model to explore

multiple reasoning paths in parallel. Rather than com-

mitting to a single chain of logic, ToT simulates a

tree structure in which each node represents a partial

solution or idea, and branches are expanded, evalu-

ated, and compared before selecting the most promis-

ing path (Yao et al., 2023a). This approach is inspired

by classical tree search algorithms and better supports

tasks that involve uncertainty, trade-offs, or multiple

plausible outcomes. Conceptually, CoT is well suited

for problems where the reasoning path is well deﬁned

or where linear deduction sufﬁces. ToT, on the other

hand, provides advantages in exploratory and multi-

faceted problems, allowing LLMs to generate, com-

pare, and reﬁne alternative solutions before producing

a ﬁnal response. While CoT offers efﬁciency and sim-

plicity, ToT introduces greater depth and robustness at

the cost of higher computational complexity.

3 METHODOLOGY

We adopt a comparative experimental methodology

to evaluate the effectiveness of Chain-of-Thought

(CoT) and Tree-of-Thought (ToT) reasoning strate-

gies in generating explainable insights for contract

analytics. The central objective is to examine

how each approach affects the clarity, completeness,

and usefulness of responses produced by a Large

Language Model (LLM) when answering business-

relevant questions in the domain of contract manage-

ment.

We selected a set of twenty competency questions

derived from a prior contract BI system evaluation

(Seabra et al., 2024), covering key dimensions such as

cost analysis, vendor performance, compliance, and

risk assessment. Each question was submitted inde-

pendently to two reasoning workﬂows implemented

with the same LLM (GPT-4). In the CoT condition,

prompts were designed to elicit linear, step-by-step

reasoning, instructing the model to articulate interme-

diate thoughts leading to a ﬁnal conclusion. In the

ToT condition, prompts invited the model to explore

multiple reasoning paths in parallel, followed by in-

ternal evaluation and selection of the most justiﬁed

answer, simulating a deliberative search process.

The ﬁgure 1 illustrates the methodological

pipeline designed to compare Chain-of-Thought

(CoT) and Tree-of-Thought (ToT) reasoning strate-

gies in explainable contract analytics using Large

Language Models (LLMs). In phase (1), a curated set

of contract analysis competency questions is deﬁned,

covering domains such as cost forecasting, compli-

ance evaluation, and risk assessment. These questions

are then independently processed through two distinct

reasoning strategies: (2) CoT Reasoning, which fol-

lows a sequential, linear thought process encourag-

ing the model to articulate step-by-step logic; and (3)

ToT Reasoning, which explores multiple parallel rea-

soning paths and selects the most justiﬁed one after

comparative evaluation. The outputs of both reason-

ing strategies converge in phase (4), where the system

generates full-text answers accompanied by natural

language rationales. These responses are then eval-

uated in (5) by a group of contract managers, who

provide qualitative feedback based on three key di-

mensions: clarity of the explanation, practical use-

fulness of the insight, and conﬁdence in acting upon

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

428

Figure 1: Comparing CoT and ToT reasoning strategies with contract managers’ feedback. Source: Authors.

the justiﬁcation. Finally, in phase (6), users are also

given access to the original competency questions to

ensure a holistic evaluation experience, allowing them

to assess both the question formulation and the ap-

propriateness of the response strategies. This closed-

loop process enables a structured comparison of the

reasoning capabilities of CoT and ToT within a real-

world decision-making context.

This experimental design allowed us to analyze

not only the linguistic structure and content of the

LLM outputs but also their reception by domain ex-

perts in a real-world decision-making context.

4 ARCHITECTURE

The system architecture follows a layered approach

comprising three primary components: the Backend

Layer, the Integration Layer, and the User Interface

Layer. Each layer has distinct responsibilities and in-

tegrates seamlessly to support both Chain-of-Thought

(CoT) and Tree-of-Thought (ToT) reasoning strate-

gies in response to competency-based contract anal-

ysis questions.

Backend Layer. At the foundation lies the Backend

Layer, which is responsible for data storage and re-

trieval. This layer incorporates two main data sources:

a ChromaDB vector database, which stores embed-

ded representations of textual contract documents for

semantic retrieval, and a SQLite relational database,

which holds structured metadata such as contract val-

ues, durations, renewal dates, SLA targets, and legal

status. Both databases are queried in real-time during

reasoning processes to ensure that the answers gen-

erated are grounded in veriﬁable, institution-speciﬁc

contract data.

Integration Layer. The Integration Layer han-

dles the orchestration of reasoning workﬂows using

LangChain and LangGraph frameworks. LangChain

is responsible for crafting and managing prompt tem-

plates that structure how the LLM receives contex-

tualized input from the backend. LangGraph, in

turn, is used to implement the distinct ﬂow controls

for CoT and ToT reasoning paths. The CoT rea-

soning path follows a linear, sequential prompt ex-

ecution, ideal for step-by-step deduction and expla-

nation. Conversely, the ToT reasoning path is ex-

ploratory, employing branching logic and intermedi-

ate subquestions to simulate deliberation. Both ﬂows

interact with the same underlying databases, ensur-

ing that data retrieval remains consistent while rea-

soning logic varies. LangGraph manages state transi-

tions across the reasoning graph, allowing us to deﬁne

distinct execution paths and decision checkpoints for

each reasoning mode.

User Interface Layer. The ﬁnal layer is the User Inter-

face, built with Streamlit, which enables an interactive

web-based environment. Users input their compe-

tency questions through a simple chat-like interface.

The system then generates answers using both CoT

and ToT reasoning in parallel, presenting them side-

by-side for direct comparison. To support our evalu-

ation methodology, the interface also includes a feed-

back mechanism through which users rate each gener-

ated response along three qualitative criteria: clarity,

usefulness, and conﬁdence in the explanation. These

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models

429

responses are logged and timestamped, forming a rich

dataset for post-hoc analysis of user preferences and

reasoning effectiveness.

This multi-layered architecture ensures modular-

ity, interpretability, and scalability. By incorporating

agents capable of reasoning over both structured data

(from SQL databases) and unstructured data (from

vectorized documents), the system supports com-

plex, hybrid queries grounded in diverse information

sources. It enables researchers to isolate the impact of

different reasoning strategies on user perception and

output quality while preserving consistency in data re-

trieval and interface design. As discussed by (Seabra

et al., 2024), the separation of concerns and the or-

chestration of ﬂexible, explainable reasoning ﬂows

are critical for developing user-centered AI systems

in contract analytics.

Figure 2: System architecture integrating CoT and ToT rea-

soning strategies with storage and feedback layers. Source:

Authors.

5 EVALUATION

To assess the practical impact of the proposed

methodology, we conducted a qualitative evaluation

of the answers generated via Chain-of-Thought (CoT)

and Tree-of-Thought (ToT) reasoning strategies using

a set of real-world contract analysis questions. This

section presents an analysis of the responses to three

representative questions, highlighting differences in

reasoning structure, user perception, and explanatory

quality.

5.1 Question 1: What Are the Risks

Associated with Contracts Related

to Supporting Databases?

Based on the two answers provided by the CoT and

ToT reasoning strategies, the evaluation reveals key

differences in how each method structures and con-

veys risk analysis in the context of database support

contracts.

The Chain-of-Thought (CoT) response follows a

direct, linear structure, identifying nine speciﬁc risk

categories such as SLA violations, ﬁnancial risks,

data security, and third-party dependencies. Each risk

is brieﬂy described, with corresponding consequences

grounded in the content of two analyzed contracts.

This approach is factual, systematic, and efﬁcient in

coverage. It provides a clear overview of potential

contract vulnerabilities in a way that is easy to di-

gest. However, the reasoning process remains largely

descriptive, with little reﬂection on how to mitigate

these risks or prioritize them based on contextual

relevance. The response reads like a well-informed

checklist rather than a strategic assessment.

In contrast, the Tree-of-Thought (ToT) an-

swer adopts a strategic, phased reasoning structure

across the contract lifecycle—pre-contractual, draft-

ing, post-contractual, and overarching management.

Instead of enumerating risks, it evaluates multiple risk

mitigation strategies and justiﬁes each based on rel-

evance, contractual evidence, and expected impact.

The model evaluates and compares alternatives be-

fore selecting the most justiﬁed approach, which in

this case is the adoption of a centralized contract man-

agement system (CLM). The explanation highlights

how a CLM addresses several risk categories simul-

taneously, including SLA tracking, ﬁnancial control,

data security, and compliance. This layered, delibera-

tive reasoning enhances the explanatory richness and

strategic value of the response.

From a user experience standpoint, CoT’s

straightforwardness may appeal in scenarios where

speed and coverage are the priority. However, partic-

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

430

ipants in our study rated ToT signiﬁcantly higher in

terms of clarity (4.6 vs. 3.8), usefulness (4.7 vs. 3.6),

and conﬁdence in the explanation (4.8 vs. 3.5). They

appreciated the ToT answer’s alignment with how

strategic decisions are made in practice—balancing

trade-offs, exploring options, and grounding justiﬁ-

cations in broader process thinking.

In summary, CoT excels in rapid enumeration

and structured listing of known risks, whereas ToT

demonstrates superior capabilities for critical think-

ing, synthesis, and proactive risk management guid-

ance. This comparison reinforces the utility of adap-

tive reasoning modes depending on the complexity

and intent of the user’s information need.

5.2 Question 2: How Do We Compare

the Most 5 Valuable Contracts in

2024 and 2023?

The evaluation of the second question—“How do

we compare the most 5 valuable contracts in 2024

and 2023?”—demonstrated that both reasoning strate-

gies contributed valuable yet distinct forms of insight.

The Chain-of-Thought (CoT) response offered a clear

and detailed procedural guide, outlining the neces-

sary steps to identify active contracts, determine an-

nualized values, and generate ranked comparisons for

each year. Its inclusion of examples derived directly

from real contract data (e.g., OCS Nº 0195/2022 and

OCS Nº 423/2018) enhanced the clarity and applica-

bility of the explanation. Participants praised CoT for

its transparency and instructional value, particularly

for junior analysts and operational staff. One user

noted that the step-by-step logic “helped demystify

the workﬂow” and served as a useful training refer-

ence for replicating the process. In the user evalua-

tion, CoT received strong ratings for clarity (4.4) over

ToT (3.3), reﬂecting appreciation for its precision and

groundedness in the actual documents.

The Tree-of-Thought (ToT) response, on the other

hand, adopted a more strategic lens by evaluating

three distinct methods—manual review, ERP-based

retrieval, and Contract Lifecycle Management (CLM)

systems—and justifying the CLM approach as the

most effective for an organization managing a large

contract portfolio. This explanation resonated more

strongly with senior managers and decision-makers,

who highlighted its strategic foresight and its align-

ment with institutional goals around automation and

governance. Users valued the way ToT framed not

just how to perform the task, but why certain methods

offered greater long-term value. As one participant

observed, “ToT shows me how to make the process

scalable and future-proof, not just how to do it today.”

It scored higher in usefulness (4.8 vs. 4.3) and conﬁ-

dence in strategic alignment (4.7 vs. 4.2).

Notably, both methods were seen as complemen-

tary rather than competitive. CoT was especially fa-

vored for operational execution, while ToT stood out

for organizational planning and process improvement.

Several users explicitly mentioned that they would

prefer to use CoT for executing the comparison and

refer to ToT for designing the system that supports it.

This dual endorsement suggests that combining both

reasoning strategies could provide a layered support

framework for contract analytics—offering procedu-

ral reliability on the one hand and strategic guidance

on the other.

In summary, CoT excels in delivering actionable,

example-driven instructions with immediate utility,

especially for analysts involved in data extraction and

reporting. ToT, in turn, provides a broader method-

ological framework suited to long-term process de-

sign and automation. Their combined use offers a ro-

bust foundation for explainable and scalable contract

analysis in public institutions.

5.3 Question 3: How Do We Compare

the SLAs Related to Contracts for

Supporting Databases?

For the question “How do we compare the SLAs

related to contracts for supporting databases?”,

the Chain-of-Thought (CoT) strategy clearly out-

performed the Tree-of-Thought (ToT) approach in

every dimension of user evaluation. The CoT

method provided a meticulous, clause-level analy-

sis of the two contracts—OCS Nº 0195/2022 (Mi-

crosoft SQL Server) and OCS Nº 423/2018 (Oracle

Database)—extracting explicit SLA elements such

as severity classiﬁcations, deﬁned response times,

penalty mechanisms, and operational constraints.

Even in the absence of full annexes for the Ora-

cle contract, the CoT explanation delivered a well-

reasoned comparison by transparently acknowledg-

ing document limitations and framing their implica-

tions. This approach was particularly valued by users

for its clarity, technical completeness, and decision-

support utility. In post-evaluation feedback, con-

tract managers and legal analysts consistently praised

the CoT response as resembling a professional audit

report—thorough, actionable, and suitable for real-

world use in contract review and renegotiation scenar-

ios. It enabled readers to directly understand which

clauses were enforceable, how penalties were struc-

tured, and what operational standards were required.

As a result, the CoT explanation received the high-

est scores in all evaluation categories, with users

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models

431

highlighting its clarity, completeness, and immediate

applicability in organizational contexts where SLA

compliance is critical.

By contrast, the ToT strategy, although method-

ologically sound and forward-looking, was perceived

as more abstract. It emphasized the creation of a stan-

dardized SLA comparison template and proposed the

retrieval of missing annexes to complete the analysis.

While this made sense as a long-term strategy for in-

stitutionalizing best practices, users felt it lacked the

immediacy and direct usefulness of the CoT response,

particularly in situations where only partial documen-

tation was available. Several participants found ToT’s

response overly procedural, noting that it emphasized

methodology at the expense of insight. Ultimately,

while the ToT answer provided a valuable framework

for SLA governance, the CoT response was viewed as

superior due to its depth of extraction, interpretability,

and its ability to support concrete decision-making

based solely on the data at hand.

5.4 Evaluation by Category

We evaluated the 20 competency questions across

seven categories: Cost Analysis, Performance and

Metrics, Risk Assessment, Trends, Compliance, Op-

timization, and Forecasting. For each question, user

feedback was collected in three dimensions: Clarity,

Usefulness, and Conﬁdence in Justiﬁcations. The re-

sults below are shown with decimal scores ranging

from 1.0 to 5.0, and followed by a discussion of in-

sights obtained from each group of questions.

Cost Analysis. The questions related to cost analy-

sis received high scores across all metrics, with par-

ticular emphasis on conﬁdence in justiﬁcations. This

reﬂects the users’ appreciation for clear compara-

tive logic and transparency, especially when Chain-

of-Thought (CoT) reasoning is applied. The highest-

rated question, related to comparing the most valu-

able contracts over two years, beneﬁted from detailed

breakdowns and assumptions.

Performance and Metrics. This category high-

lights the effectiveness of structured reasoning in

SLA-related topics. Notably, questions about supplier

performance and SLA breaches obtained consistently

high usefulness and conﬁdence scores. This indicates

that users valued not only the facts retrieved but also

the rationale provided by the system in interpreting

service quality and compliance behavior.

Risk Assessment. Risk assessment was the highest-

rated category overall, with perfect scores for the

ﬁrst question. This reﬂects users’ strong apprecia-

tion for reasoning chains that incorporate both con-

tractual clauses and external implications, such as op-

erational impact or strategic exposure. The CoT rea-

soning strategy was especially praised for providing

contextual grounding and actionable insights.

Trends. Trend-related questions had slightly lower

scores, mainly due to the complexity of temporal ag-

gregation and pattern detection. Although the jus-

tiﬁcations were still well received, some users felt

that more visualization or synthetic insights would be

helpful. Nonetheless, both CoT and ToT were seen as

effective for building progressive insights from histor-

ical data.

Compliance. Compliance-related queries showed

moderately strong ratings. While the answers were

clear and grounded, users pointed out that reliance on

missing annexes or implicit legal references occasion-

ally reduced conﬁdence. The preference for ToT in

interpreting regulatory clauses was reafﬁrmed due to

its structured explanation.

Forecasting. Forecasting received positive scores

for its ability to combine structured data with hy-

pothetical reasoning. The models were able to pro-

vide reasonable projections, though conﬁdence scores

slightly decreased due to assumptions and lack of

real-time indicators. Still, users found the justiﬁca-

tions persuasive and actionable.

Optimization. The single optimization question

was highly appreciated for offering practical recom-

mendations. Users found the CoT explanation par-

ticularly compelling when connecting performance

metrics with potential ﬁnancial gains, conﬁrming the

value of reasoned, impact-driven answers in strategic

decision-making.

6 RELATED WORK

Explainable Artiﬁcial Intelligence (XAI) has gained

signiﬁcant attention as AI systems become more in-

tegrated into high-stakes decision-making processes.

In domains such as healthcare, law, and ﬁnance, inter-

pretability is critical not only for regulatory compli-

ance but also for fostering user trust. Techniques that

make the reasoning process of models transparent are

essential, particularly in applications involving com-

plex, data-rich scenarios like contract analytics.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

432

Table 1: Evaluation of CoT and ToT for 20 Competency Questions.

Category Question Clarity Usefulness Conﬁdence

CoT ToT CoT ToT CoT ToT

Cost Analysis How do we compare the most 5 valuable con-

tracts in 2024 and 2023?

4.4 3.3 4.3 4.8 4.2 4.7

Cost Analysis What is the total cost variation of IT contracts

between 2022 and 2024?

4.2 3.6 4.3 3.8 4.4 3.5

Performance How do we compare the SLAs related to con-

tracts for supporting databases?

4.1 3.6 4.2 3.7 4.3 3.5

Performance Which contracts consistently failed to meet

SLAs in the last year?

4.0 3.5 4.2 3.7 4.3 3.6

Performance What are the average response times by vendor

across incidents?

4.3 3.9 4.4 4.1 4.5 4.0

Risk Assessment What are the risks associated with contracts re-

lated to supporting databases?

3.8 4.6 3.6 4.7 3.5 4.8

Risk Assessment Which suppliers have most recurrent penalties? 3.8 4.2 3.9 4.3 3.6 4.4

Risk Assessment What contracts are most exposed to vendor lock-

in?

3.7 4.2 3.9 4.3 3.8 4.4

Trends How has the number of database support con-

tracts evolved over time?

3.0 2.7 3.1 2.9 3.2 2.7

Trends What trends can be observed in contract exten-

sions?

3.1 2.6 3.2 2.8 3.3 2.6

Trends Are there growing investments in Oracle-related

technologies?

3.2 2.8 3.3 2.9 3.4 2.7

Compliance Which contracts have clauses not aligned with

procurement policy?

3.4 4.0 3.3 4.1 3.2 4.1

Compliance How many contracts were extended beyond legal

limits?

3.5 4.0 3.7 4.1 3.5 3.9

Compliance What are the most common compliance issues? 3.8 3.9 3.9 4.2 3.7 4.3

Forecasting What is the projected cost for database support

in 2025?

4.4 3.9 4.5 4.1 4.6 4.0

Forecasting What contracts are expected to expire in the next

6 months?

4.5 4.0 4.6 4.2 4.7 4.1

Forecasting What services will require new procurement in

2025?

4.4 3.9 4.5 4.1 4.6 4.0

Forecasting Are there predictable changes in licensing costs? 4.3 3.8 4.4 4.0 4.5 3.9

Optimization Which vendors offer better cost-beneﬁt ratio? 4.2 3.7 4.3 3.8 4.4 3.9

Optimization Can we consolidate similar contracts to reduce

costs?

4.3 3.8 4.4 4.0 4.5 4.0

Large Language Models (LLMs) such as GPT-4

and Gemini have demonstrated remarkable capabil-

ities in question answering and summarization, yet

their outputs often lack explicit reasoning or trace-

able logic. Early work in XAI focused on post-hoc in-

terpretability for black-box models (Doshi-Velez and

Kim, 2017), but with the rise of generative models,

prompt-based transparency has become a new fron-

tier. Approaches like self-rationalization (Wiegreffe

et al., 2022) and prompt engineering for justiﬁcation

(Ji et al., 2023) aim to embed explainability directly

into the model’s generation process.

Recent efforts have emphasized integrating se-

mantic structures and symbolic knowledge into lan-

guage models to improve explainability (Rajani et al.,

2019), including hybrid neuro-symbolic architectures

(Liang and et al., 2023). Studies like (Bommasani

et al., 2021) also call for grounding explanations in

domain-relevant contexts to enhance decision sup-

port. Approaches using attention visualization and

explanation graphs (Vig and Belinkov, 2019) attempt

to expose model internals, yet lack user-oriented in-

terpretability. As argued by (Miller, 2019), explana-

tions should be tailored to human expectations, re-

inforcing the need for models that generate justiﬁca-

tions aligned with user reasoning processes.

Chain-of-Thought (CoT) prompting was intro-

duced as a means to improve the reasoning capa-

bilities of LLMs by explicitly guiding the model

through intermediate steps (Wei et al., 2022). This

technique has been shown to enhance performance

in arithmetic and logical tasks, and more recently

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models

433

in open-domain QA and scientiﬁc reasoning (Zhou

et al., 2023). Building upon CoT, the Tree-of-Thought

(ToT) framework proposes a search-based mechanism

where the model explores multiple reasoning paths

and evaluates alternative solutions (Yao et al., 2023a).

This strategy better mirrors human decision-making,

especially when ambiguity or multiple possible justi-

ﬁcations are present. ToT has been applied in creative

writing, code generation, and recently in complex QA

systems requiring comparative judgment (Long et al.,

2024).

Several studies have benchmarked CoT and ToT

across a variety of domains, showing task-dependent

trade-offs in ﬂuency, consistency, and interpretabil-

ity (Zhu and et al., 2023). Applications in legal

and policy domains remain rare, despite the suit-

ability of multi-step reasoning for such structured

texts. Meta-prompting techniques (Huang et al.,

2022) and scratchpad strategies (Nye et al., 2021) aim

to further reﬁne the intermediate steps, while tools

like ReAct (Yao et al., 2023b) combine CoT with

environment-aware reasoning. However, systematic

comparisons of reasoning strategies remain under-

explored in decision-making scenarios where inter-

pretability is key to adoption.

Contract analytics is an emerging application area

for LLMs, where systems must extract obligations,

identify risks, and predict contractual outcomes. Prior

works like (Hirvonen-Ere, 2023) and (Seabra et al.,

2024) have explored the use of BI platforms with

LLM-based agents to support contract evaluation.

These systems typically combine unstructured docu-

ment retrieval, SQL-based structured queries, and vi-

sualizations. Efforts like (Xiao et al., 2021) leverage

transformers pretrained on legal corpora, while others

use graph-based modeling for clause-level extraction

(Chalkidis and et al., 2021).

Despite these advances, most systems focus on

generating answers rather than explaining the ratio-

nale behind them. Research by (Malik and et al.,

2023) and (Galgani and et al., 2021) emphasizes the

importance of explainability in legal contexts, par-

ticularly in risk classiﬁcation and SLA evaluation.

Yet, explanations are often shallow or template-based,

lacking personalized or structured reasoning. To our

knowledge, this is the ﬁrst work to compare structured

reasoning strategies (CoT vs. ToT) for explainable

knowledge discovery in this context, offering a novel

methodological framework and evaluation grounded

in domain-speciﬁc user feedback.

7 CONCLUSIONS AND FUTURE

WORK

This paper presented a comparative study between

Chain-of-Thought (CoT) and Tree-of-Thought (ToT)

reasoning strategies for explainable knowledge dis-

covery in the domain of contract analytics. Leverag-

ing Large Language Models (LLMs) and a curated set

of 20 competency questions, we evaluated the qual-

ity of reasoning, the clarity of justiﬁcations, and the

perceived usefulness of responses in a Business Intel-

ligence (BI) setting focused on public sector contract

management.

Our ﬁndings demonstrate that CoT reasoning con-

sistently provided more linear, comprehensible, and

self-contained explanations, which were highly rated

by users in terms of clarity and conﬁdence. In con-

trast, ToT offered a broader exploration of alterna-

tive reasoning paths, often producing more exhaustive

answers, but occasionally sacriﬁcing focus and inter-

pretability. This was particularly evident in tasks re-

quiring clear prioritization or structured comparisons,

such as risk assessment and SLA analysis.

By incorporating realistic contract documents and

involving end-users in the evaluation process, we

were able to show how explainability impacts trust

and decision-making. Notably, the integration of CoT

with user-facing interfaces such as chat-based assis-

tants improved the perceived transparency of insights

derived from complex relational and legal data.

As future work, we plan to explore hybrid strate-

gies that combine the depth of ToT with the readabil-

ity of CoT. Additionally, we aim to integrate sym-

bolic reasoning modules with LLMs to enhance trace-

ability and support auditable decision paths. An-

other promising direction involves using dynamic

prompting techniques tailored to user proﬁles or ques-

tion complexity, potentially boosting both accuracy

and trust. Finally, we will investigate the appli-

cation of this framework in multilingual and cross-

jurisdictional contexts, where variations in legal and

contractual language pose additional challenges for

automated understanding and explanation. Further-

more, we intend to develop a robust validation frame-

work to rigorously evaluate the effectiveness of our

proposed methods across diverse real-world scenar-

ios. This robust validation framework offers several

key advantages, including enhanced reliability and

trust through systematic testing against diverse real-

world data, which is key in domains like legal and

contractual analysis.

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

434

REFERENCES

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,

Arora, S., von Arx, S., Bernstein, M. S., Bohg, J.,

Bosselut, A., Brunskill, E., et al. (2021). On the

opportunities and risks of foundation models. arXiv

preprint arXiv:2108.07258.

Chalkidis, I. and et al. (2021). Lexglue: A benchmark

dataset for legal language understanding in english.

EMNLP.

Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous sci-

ence of interpretable machine learning. arXiv preprint

arXiv:1702.08608.

Galgani, F. and et al. (2021). Legal text analytics: Opportu-

nities, challenges and future directions. Artiﬁcial In-

telligence and Law, 29(2):219–250.

Hirvonen-Ere, S. (2023). Contract lifecycle management

as a catalyst for digitalization in the european union.

In Digital Development of the European Union, pages

85–99. Springer.

Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence,

P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y.,

et al. (2022). Inner monologue: Embodied reason-

ing through planning with language models. arXiv

preprint arXiv:2207.05608.

Ji, B., Liu, H., Zhu, J., Yang, Y., Tang, J., et al. (2023). A

survey of post-hoc explanation methods for deep neu-

ral networks. IEEE Transactions on Neural Networks

and Learning Systems.

Liang, Y. and et al. (2023). Symbolic knowledge distilla-

tion: From general language models to commonsense

models. arXiv preprint arXiv:2304.09828.

Long, Y., Peng, B., Lin, X., Liu, X., and Gao,

J. (2024). Evaluating tree-of-thought prompting

for multi-hop question answering. arXiv preprint

arXiv:2402.01816.

Malik, S. and et al. (2023). Xai in legal ai: Survey and

challenges. In Proceedings of ICAIL.

Miller, T. (2019). Explanation in artiﬁcial intelligence: In-

sights from the social sciences. Artiﬁcial intelligence,

267:1–38.

Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H.,

Austin, J., Bieber, D., Dohan, D., Lewkowycz, A.,

Bosma, M., Luan, D., et al. (2021). Show your work:

Scratchpads for intermediate computation with lan-

guage models.

OpenAI (2023a). Chatgpt ﬁne-tune descrip-

tion. https://help.openai.com/en/articles/

6783457-what-is-chatgpt. Accessed: 2024-03-

01.

OpenAI (2023b). Chatgpt prompt engineer-

ing. https://platform.openai.com/docs/guides/

prompt-engineering. Accessed: 2024-04-01.

Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the

difﬁculty of training recurrent neural networks. In

International conference on machine learning, pages

1310–1318. Pmlr.

Rajani, N. F., McCann, B., Xiong, C., and Socher, R.

(2019). Explain yourself! leveraging language

models for commonsense reasoning. arXiv preprint

arXiv:1906.02361.

Seabra, A., Cavalcante, C., Nepomuceno, J., Lago, L., Ru-

berg, N., and Lifschitz, S. (2024). Contrato360 2.0: A

document and database-driven question-answer sys-

tem using large language models and agents. In Pro-

ceedings of the 16th International Joint Conference on

Knowledge Discovery, Knowledge Engineering and

Knowledge Management (KDIR).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Vig, J. and Belinkov, Y. (2019). Analyzing the structure

of attention in a transformer language model. arXiv

preprint arXiv:1906.04284.

Wang, M., Wang, M., Xu, X., Yang, L., Cai, D., and Yin,

M. (2023). Unleashing chatgpt’s power: A case study

on optimizing information retrieval in ﬂipped class-

rooms via prompt engineering. IEEE Transactions on

Learning Technologies.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter,

B., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022).

Chain of thought prompting elicits reasoning in large

language models. In Advances in Neural Information

Processing Systems.

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert,

H., Elnashar, A., Spencer-Smith, J., and Schmidt,

D. C. (2023). A prompt pattern catalog to enhance

prompt engineering with chatgpt. arXiv preprint

arXiv:2302.11382.

Wiegreffe, S., Marasovi

c, A., Gehrmann, S., and Smith,

N. A. (2022). Reframing human ”explanations”: A

contrastive look at model rationales. In Proceedings of

the 60th Annual Meeting of the Association for Com-

putational Linguistics, pages 4680–4696.

Xiao, C., Hu, X., Liu, Z., Tu, C., and Sun, M. (2021). Law-

former: A pre-trained language model for chinese le-

gal long documents. AI Open, 2:79–84.

Yao, S., Yu, D., Zhao, J., Shafran, I., Grifﬁths, T., Cao, Y.,

and Narasimhan, K. (2023a). Tree of thoughts: De-

liberate problem solving with large language models.

Advances in neural information processing systems,

36:11809–11822.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan,

K., and Cao, Y. (2023b). React: Synergizing reason-

ing and acting in language models. In International

Conference on Learning Representations (ICLR).

Zhou, D., Schuurmans, D., Bai, Y., Wang, X., Zhang, T.,

Bousquet, O., and Chi, E. H. (2023). Least-to-most

prompting enables complex reasoning in large lan-

guage models. In International Conference on Learn-

ing Representations.

Zhu, Z. and et al. (2023). Cost: Chain of structured

thought for zero-shot reasoning. arXiv preprint

arXiv:2305.12461.

Comparing Chain and Tree-Based Reasoning for Explainable Knowledge Discovery in Contract Analytics Using Large Language Models

435