CGNTM: Unsupervised Causal Topic Modeling with LLMs and

Nonlinear Causal GNNs

Peixuan Men

, Longchao Wang

, Aihua Li

and Xiaoli Tang

Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences and Peking Union Medical

College, Beijing, China

Keywords:

Causal Topic Modeling, Unsupervised Learning, Neural Causal Modeling, Graph Neural Networks.

Abstract:

We propose CGNTM, a fully unsupervised causal topic model that integrates large language models (LLMs)

with neural causal inference. Unlike conventional and supervised topic models, CGNTM learns both hierar-

chical topics and their directed causal relations directly from raw text, without requiring labeled data. The

framework leverages LLM-based prompt extraction to identify salient keywords and candidate causal pairs,

which are reﬁned through differentiable Directed Acyclic Graph (DAG) learning and modeled via a nonlinear

structural causal model (SCM). A directionally masked graph neural network (GNN) propagates information

strictly along causal edges, while a Wasserstein Generative Adversarial Network (GAN) enforces semantic

consistency under counterfactual interventions via BERT-based regularization. This combination enables the

model to not only discover coherent and diverse topics but also uncover interpretable causal relationships

among them. The architecture supports hierarchical topic organization by clustering ﬁne-grained terms into

broader themes and modeling cross-level dependencies through dual-layer message passing. Experimental re-

sults demonstrate that CGNTM outperforms state-of-the-art models in topic quality and causal interpretability.

Ablation studies conﬁrm the essential role of each component-LLM-guided extraction, nonlinear SCM, direc-

tional GNN propagation, and adversarial training-in contributing to both causal accuracy and topic coherence.

The proposed framework opens new directions for unsupervised causal discovery in text, offering transforma-

tive potential in domains where understanding why certain topics co-occur is as crucial as identifying what

they are.

1 INTRODUCTION

Topic modeling is a vital tool in natural language

processing for uncovering hidden themes in large

text corpora. Classical models like Latent Dirich-

let Allocation (LDA) summarize documents into in-

terpretable topics, supporting tasks such as classiﬁ-

cation and retrieval, but rely on bag-of-words, as-

sume independence, and ignore semantic dependen-

cies, limiting interpretability and omitting concept

relationships (Morstatter and Liu, 2018). Recent

Neural Topic Models (NTMs) leverage deep genera-

tive networks for ﬂexible inference, enhancing coher-

ence through contextualized embeddings or external

knowledge (Shen et al., 2021). However, they capture

only statistical co-occurrence, not causal relationships

https://orcid.org/0009-0002-2630-3838

https://orcid.org/0009-0009-1387-3517

https://orcid.org/0000-0001-6742-3268

https://orcid.org/0000-0001-6946-3482

among topics, hindering interpretability and the abil-

ity to answer “why” questions from text data.

Recent efforts integrate causality, such as the

supervised Causal Relationship-Aware Neural Topic

Model (CRNTM) (Tang et al., 2024), which uses

Structural Causal Models (SCMs) to uncover topic-

label links in a Directed Acyclic Graph (DAG). This

improves structure and quality but requires supervi-

sion. Discovering causal relations in unlabeled cor-

pora, particularly with hierarchical organization, re-

mains an open challenge (Lagemann et al., 2023).

This paper addresses unsupervised causal topic

discovery: identifying hierarchical topics and infer-

ring a DAG of their causal relationships from raw

text without supervision. This tackles the intertwined

challenges of multi-granularity topic extraction and

causal graph inference using statistical patterns and

semantic knowledge for interpretable structures.

We propose the Causal Graph Neural Topic Model

(CGNTM), integrating LLMs, causal graph learning,

Men, P., Wang, L., Li, A. and Tang, X.

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs.

DOI: 10.5220/0013708200004000

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 275-285

275

and a GNN in an unsupervised pipeline. It extracts

keywords and causal pairs via LLMs, reﬁnes them

through NOTEARS (Zheng et al., 2018) to enforce

DAG constraints, models nonlinear causality with a

neural SCM and directional GNN, and reﬁnes results

via a WGAN with BERT-based consistency for coun-

terfactual generation, enhancing topic coherence and

interpretability.

In summary, this paper introduces CGNTM as a

novel unsupervised framework for causal topic dis-

covery. Its key contributions are as follows:

• The ﬁrst unsupervised topic modeling approach

integrating LLMs to extract causal relationships

and construct a causal topic graph from unlabeled

corpora.

• A novel architecture combining a nonlinear neural

SCM with a masked directional GNN for cause-

to-effect propagation.

• Hierarchical causal structure via two-layer GNN

and clustering for multi-granular interpretation.

• Adversarial counterfactual training with WGAN

and BERT consistency to improve coherence and

prevent spurious links.

2 RELATED WORK

2.1 Neural and Hierarchical Topic

Modeling Approaches

Topic modeling has evolved from classical proba-

bilistic models to deep learning-based methods. Tra-

ditional approaches like Latent Dirichlet Allocation

(LDA) (Blei et al., 2003) and its hierarchical exten-

sions (Miao et al., 2016) infer topics as latent vari-

ables under assumptions such as word exchangeabil-

ity and topic independence, which limit expressive-

ness and coherence.

NTMs address these by incorporating variational

autoencoders or transformers for ﬂexible representa-

tions. Key examples include ProdLDA (Srivastava

and Sutton, 2017), Embedded Topic Model (ETM)

(Dieng et al., 2020) , and contextualized models

(Venugopalan and Gupta, 2022), which use learned

distributions and semantic embeddings to boost in-

terpretability. Recent advances integrate graphs or

knowledge, such as BERTopic (Grootendorst, 2022)

for embedding-based topics and BERT-Flow-VAE

(Liu et al., 2022) for calibrated embeddings in VAE

modeling.

However, these methods still treat topic relations

as statistical correlations or taxonomic hierarchies

rather than causal dependencies.

2.2 Causal Discovery and Structural

Causal Models in NLP

Causal discovery is gaining traction in NLP, espe-

cially topic modeling. SCMs formalize relations via

functional equations like , where are parents in a DAG

(Yang et al., 2022; Pawlowski et al., 2020), though

unsupervised text applications are limited.

Works like CRNTM incorporate SCMs to link

latent topics and labels , but require supervision.

CausalVAE adds a causal layer in VAEs for latent

DAGs without strong supervision (Panwar et al.,

2020; Yang et al., 2021), yet focuses on disentan-

gled factors, not textual topics (Wu et al., 2024). Few

explore fully unsupervised causal discovery in NLP

(Prostmaier et al., 2025), often relying on labels or

disentanglement rather than interpretable themes.

2.3 GNNs for Causal Inference and

Topic Modeling

GNNs model relational patterns effectively. In causal

discovery, DAG-GNN combines GNNs with VAEs

for DAG learning from data, optimizing acyclic-

ity constraints (Yu et al., 2019; Park and Kim,

2023). For topic modeling, GNNs propagate in-

formation via graphs, as in GNTM (Shen et al.,

2021) and GraphBTM (Zhu et al., 2018) using co-

occurrence statistics. However, integrating GNN-

driven causal learning with topics is underexplored

(Gao et al., 2024; Behnam and Wang, 2024) meth-

ods treat GNNs as auxiliaries, not embedding them in

causal pipelines. No work jointly learns topics and

causal structures end-to-end unsupervised.

2.4 Large Language Models for

Zero-Shot Keyword and Relation

Extraction

LLMs demonstrate strong zero-shot capabilities in

extracting keywords and semantic relations from text

(Rana et al., 2024; Chen et al., 2023). Using natural-

language prompts, LLMs can identify salient terms

and hypothesize causal connections without task-

speciﬁc training, making them valuable for uncov-

ering candidate causal pairs in unlabeled corpora.

Recent works leverage LLMs for weakly supervised

causal discovery by querying plausible relations or

generating pseudo-labels. However, standalone LLM

extraction lacks consistency across documents and

fails to yield a structured topic model.

These limitations underscore the need for

CGNTM, which integrates LLM-driven knowledge

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

276

extraction with neural causal inference. Unlike

conventional models such as BERTopic, which lack

causal interpretation, or CRNTM, which requires

supervision, CGNTM enables fully unsupervised

causal topic discovery. It combines zero-shot prompt-

based extraction with a nonlinear GNN-based causal

inference module, facilitating interpretable topic

modeling and causal structure learning without

labeled data. This positions CGNTM as a novel

contribution at the intersection of NLP, causality, and

deep learning.

3 METHODS

We propose a multi-stage framework (Figure 1) that

combines LLM-based semantic parsing with neural

causal graph modeling to uncover latent topics and

their causal relations. The pipeline proceeds as fol-

lows: (1) extract candidate keywords and causal pairs

from text using a LLM; (2) construct a global di-

rected causal graph over these candidates and reﬁne

it via NOTEARS continuous optimization to enforce

acyclicity; (3) instantiate a SCM on this graph with

neural functional mechanisms; (4) perform directed

message passing with a GNN that masks reverse (anti-

causal) edges; (5) extend the GNN to a dual-layer

architecture for hierarchical topic representation; (6)

generate counterfactual textual data by intervening on

the learned causal topic variables and using a condi-

tional WGAN with a BERT-based consistency mod-

ule; and (7) train all components under a joint objec-

tive. The subsequent sections elaborate on each stage

in detail.

Figure 1: The overall multi-stage framework of CGNTM.

As illustrated in Figure 1, our framework seam-

lessly integrates symbolic knowledge extraction with

neural causal modeling. The following subsections

detail each component’s technical implementation

and theoretical foundations.

3.1 Keyword and Causal Pair

Extraction via LLMs

The ﬁrst stage extracts candidate topic variables (key-

words) and causal links from an unlabeled corpus us-

ing a large language model (LLM). For each docu-

ment, the LLM is prompted to output: (a) salient key-

words summarizing its content, and (b) any explicit

cause-effect pairs (e.g., “X causes Y”) mentioned in

the text. Prior studies show LLMs can identify high-

quality keyphrases and infer causal relations in a zero-

shot manner. Building on this, we aggregate each doc-

ument’s extracted keywords and directed pairs into a

global set of topic terms and candidate edges. Each

edge may carry a conﬁdence score based on frequency

or LLM-reported likelihood. This unsupervised pars-

ing injects semantic knowledge into the pipeline, pro-

viding a plausible initial causal graph structure.

3.2 Global Causal Graph Construction

and NOTEARS Optimization

We next consolidate the LLM-derived keywords and

causal pairs into a uniﬁed global causal topic graph.

This graph is represented by an adjacency matrix,

where each entry indicates a directed edge between

topics. The initial structure is constructed from

the candidate edges identiﬁed in Section 3.1, with

weights set either uniformly or based on LLM-

provided conﬁdence scores. Since these raw connec-

tions may contain cycles or noisy links, we reﬁne the

graph by solving a structure learning problem with an

acyclicity constraint.

To this end, we use the NOTEARS method, which

casts DAG learning as a smooth optimization prob-

lem. It learns a weighted adjacency matrix that min-

imizes a given loss while ensuring the graph remains

acyclic through a differentiable constraint based on

the matrix exponential formulation. Speciﬁcally, we

impose:

h(A) = Tr[exp(A ◦ A)] − m = 0 (1)

where A ◦ A denotes the elementwise square

and m = |V | is the number of nodes. Intuitively,

Tr (exp(A ◦ A)) = m if and only if A has no cyclic de-

pendencies. This hard equality constraint can be han-

dled via a penalty or augmented Lagrangian method

during optimization. We thus solve:

min

A, Θ

score

(A, Θ) s.t. h(A) = 0 (2)

where Θ denotes other model parameters (dis-

cussed later) and L

score

is a structure score or loss.

We initialize A with the LLM-extracted graph and it-

eratively update it to reduce L

score

while applying the

DAG constraint (using gradient-based optimization).

This yields a global causal topic graph that best ex-

plains the data without cycles. By casting structure

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs

277

discovery as continuous optimization, we avoid brute-

force search over graphs and can efﬁciently handle

the moderate number of topic nodes. The resulting

adjacency A (with learned weights) will serve as the

backbone for our neural causal model.

3.3 Neural Structural Causal Model

(NSCM)

Given the DAG learned in section 3.2, we deﬁne a

structural causal model to describe how each topic

variable is generated as a function of its causes. Fol-

lowing the Pearlian framework, an SCM consists of

a set of structural equations X

= f

(PA

, N

) for each

node X

(representing topic v

), where PA

denotes the

set of parent variables (direct causes of X

in G) and

is an exogenous noise term (independent for each

i) capturing unmodeled variation.

In our Neural SCM, we parameterize each struc-

tural function as a neural network, allowing for com-

plex non-linear causal relationships between topics.

In particular, for each directed edge v

→ v

in the

graph, the causal inﬂuence of topic v

on topic v

modeled by learnable weights within the neural func-

tion f

. Formally, let z

be a latent representation or

“activation” of topic v

. We deﬁne:

= f



: v

∈ PA



+ n

(3)

where n

∼ N



0, σ



(or another suitable noise

distribution) and f

is implemented as a multilayer

perceptron (MLP). This yields a deep non-linear

structural equation model(SEM) that generalizes tra-

ditional linear causal models.

The collection { f

}

i=1

together with adjacency A

deﬁnes a joint distribution P(X

, . . . , X

) over topic

variables – essentially a causal topic model. Impor-

tantly, because A is acyclic, we can sample from this

model by ancestral sampling (order the topics topo-

logically and sample each X

from f

(PA

)). The

Neural SCM grants our model the capacity to capture

complex interactions (e.g., a cause may affect an ef-

fect in a highly non-linear or context-dependent way)

(Ze

cevi

c et al., 2021). It also enables interventions:

we can apply the do-operator do(X

= x

∗

) by clamp-

ing z

to a chosen value and forward-simulating the

rest of the model, to infer causal effects or counter-

factual topic conﬁgurations.

In summary, at this stage we have a parameter-

ized causal model M = (A, { f

}) that explains how

topics cause one another. The parameters of f

(and

any distributional parameters of N

) will be learned

from data, typically by maximizing the likelihood of

observed topic occurrences in documents.

3.4 Directionally Masked GNN

Propagation

To leverage the learned causal graph during both

training and inference, we implement a directionally

masked Graph Attention Network (GAT) that propa-

gates information strictly along directed edges (Wang

et al., 2025; Liu et al., 2022). Unlike conventional

GATs that allow bidirectional attention, our model en-

forces causal directionality to align with the causal se-

mantics of the structural causal model (SCM).

We implement a directionally masked GNN on the

directed acyclic graph (DAG) learned in Section 3.2

(Kaushik et al., 2019). The DAG is represented by an

adjacency matrix A, where A

denotes the weight of

the directed edge from node j to node i, with A

= 0

for nonexistent or directionally excluded edges (i.e.,

edges not present in the causal graph). Each topic

node v

is associated with a hidden state vector h

(l)

at layer l of the GNN (with h

(0)

initialized from node

features, e.g., textual embedding of the keyword or an

initial topic score in a document). For each topic node

, we compute attention weights only over its parent

nodes, deﬁned as PA

= { j|A

> 0}:

= softmax(LeakyReLU(a

(l)

· h

(l)

∥ W

(l)

· h

(l)

])) (4)

Then the layer-wise update for node i is deﬁned

as:

(l+1)

= σ

∑

j∈PA

·W

(l)

· h

(l)

(5)

where ∥ denotes concatenation, a is a learnable

attention vector, and W

(l)

are layer-speciﬁc weight

matrices, and σ(·) is a non-linear activation func-

tion. The attention coefﬁcients α

serve as dynamic,

learned weights on edge j → i, capturing causal inﬂu-

ence adaptively based on node features.

In matrix form, the propagation can be expressed

as:

(l+1)

= σ



A · (H

(l)

·W

(l)

)



(6)

Where H

(l)

is the matrix of all node hidden states

at layer l, and A is the masked adjacency matrix

from Section 3.2, ensuring that information propa-

gates only along the causal directions deﬁned by the

DAG. By stacking L such layers (where L is the num-

ber of GNN layers, typically set to 2–4 based on the

graph depth and empirical tuning), each node’s repre-

sentation h

(L)

captures information from its causal an-

cestors up to L hops away. This directed GNN aligns

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

278

naturally with the semantics of the SCM: messages

ﬂow from causes to effects, mirroring how interven-

tions or changes propagate in the causal graph. We

use this GNN both to encode observational data into

latent topic representations and to simulate the spread

of causal inﬂuence in the topic graph.

Unlike standard graph convolutions that mix in-

coming and outgoing messages, our GNN restricts

propagation to incoming edges deﬁned by PA

, imple-

menting causal message passing. While inspired by

generative graph models such as DAG-GNN, our ar-

chitecture enforces strict temporal directionality, en-

suring information ﬂows from cause to effect. The di-

rectional GNN operates jointly with the neural SCM

(Section 3.3) to enable efﬁcient inference, where the

NSCM deﬁnes the generative process z

, specify-

ing how topic variables are generated based on their

causal parents, while the GNN implements the infer-

ence process by computing topic representations from

observed data while respecting causal constraints.

During training, the GNN parameters {W

(l)

, a} are

learned jointly with the SCM functions { f

} as part

of the uniﬁed optimization framework (Section 3.7),

ensuring the inference network is consistent with the

generative causal model. Given partial topic activa-

tions, the GNN infers missing variables via causal

propagation. Under interventions, it rapidly computes

perturbed topic representations, supporting counter-

factual reasoning critical for applications.

3.5 Hierarchical Topic Modeling with

Dual-Layer GNN

Real-world topics often exhibit hierarchical structure,

with ﬁne-grained concepts nested within broader the-

matic categories. To capture this hierarchy, we extend

our single-layer causal topic model to a dual-layer ar-

chitecture that explicitly represents both micro-level

topics and macro-level themes. Our hierarchical ap-

proach consists of three sequential steps: bottom-

up aggregation, horizontal propagation, and top-down

reﬁnement.

We introduce an upper layer of abstract topics

that group together semantically related base top-

ics. Concretely, suppose we partition V into K clus-

ters {C

, . . . ,C

} (each C

is a subset of the base

topics that constitute a higher-level theme). These

clusters could be obtained by heuristic clustering of

the keyword embeddings or even by another pass of

LLM-based grouping. We then create K new nodes

{ ˜v

, . . . , ˜v

} representing the abstract topics (one per

cluster). We connect each base topic node v

to its

abstract parent ˜v

(if v

∈ C

) via an undirected link,

and we also allow directed causal edges among the

abstract nodes themselves (induced by the base-level

DAG: e.g., if some v

∈ C

causes v

∈ C

, we add a

directed edge ˜v

→ ˜v

between abstract topics).

This yields a two-layer graph: Layer 1 comprises

base topics V with directed edges E (from our learned

A). Layer 2 comprises abstract topics

V with directed

edges among them. Additionally, bipartite connec-

tions link each ˜v

to all v

∈ C

We then design a two-tier message passing

scheme:

3.5.1 Bottom-up Aggregation

In the bottom-up aggregation step, each abstract topic

node k computes an initial state as the mean of its

member topic embeddings:

(0)

∑

∈C

(L)

(7)

using the ﬁnal embeddings h

(L)

from the base

GNN layer as input. This step produces a represen-

tation for the higher-level topic as a composition of

its subtopics.

3.5.2 Horizontal Propagation at the Abstract

Level

Next, in the horizontal propagation step, we apply an-

other GNN on the abstract topic graph (layer 2) for

′

steps, using the directed edges among

V . This is

analogous to section 3.4 but on the smaller graph of

K nodes.For each abstract topic node

, we compute

attention weights only over its parent nodes:

= softmax(LeakyReLU(a

(l)

∥ W

(l)

])) (8)

The layer-wise update for node

is:

(l+1)

= σ

∑

∈PA(

)

·W

(l)

l = 0, . . . , L

′

− 1 (9)

where PA(

) are abstract parent nodes of

in the

abstract DAG. α

is the attention coefﬁcient between

parent

and target

. This yields reﬁned high-level

topic embeddings

′

)

that capture how broad themes

inﬂuence each other.

3.5.3 Top-down Reﬁnement

Finally, in the top-down reﬁnement step, the abstract

nodes pass messages back to their children to update

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs

279

base topic embeddings with global context (e.g., each

may receive an additive message from its abstract

parent

). In our implementation, we incorporate

this by concatenating the parent’s representation to

the base node before a ﬁnal linear transformation.

The resulting model forms a hierarchical topic

structure: base-level nodes represent ﬁne-grained

concepts, while abstract-level nodes capture broader

thematic categories. Directional GNN propagation

ensures semantic consistency across levels. This de-

sign is inspired by hierarchical topic models such as

hLDA, which organize topics using Bayesian priors.

In contrast, our approach employs deterministic clus-

tering and neural message passing to simultaneously

model intra-level causal dependencies and inter-level

abstractions. The dual-layer GNN enhances inter-

pretability through structured topic organization and

improves predictive performance by enabling statisti-

cal sharing among related topics.

3.6 Counterfactual Generation with

WGAN and BERT Consistency

A central motivation of our causal topic model is

to enable counterfactual reasoning—i.e., answering

“what if” questions by generating text under hypo-

thetical interventions. To this end, we leverage the

learned causal graph and SCM to guide a text gen-

eration module that produces counterfactual docu-

ments, while enforcing semantic consistency via a

pretrained language model (BERT). Our approach

adopts a conditional generative adversarial network,

where the generator receives an intervened topic rep-

resentation and outputs synthetic text, and the dis-

criminator distinguishes between real and generated

samples. We employ the Wasserstein GAN variant to

enhance training stability and mode coverage (Gulra-

jani et al., 2017).

3.6.1 Conditioning on Causal Topics

The generator is conditioned on the causal topic vec-

tor of a document, which we obtain from the Neu-

ral SCM/GNN. For each real document d, we ﬁrst

infer its topic activation vector z = [z

, . . . , z

] using

the current model – this can be done by feeding d

through the GNN encoder or by direct inference in

the SCM. We then sample an intervention on z: for in-

stance, to generate a counterfactual where topic v

altered, we set z

to a new value z

′

(e.g., zero to simu-

late removing that topic, or a higher value to simulate

emphasizing it) while keeping other z

i̸=k

the same, or

also updating descendants of v

via the SCM to reﬂect

causal effects. Denote this intervened topic vector as

c f

= do



= z

′



According to our causal model, z

c f

represents a

coherent counterfactual state of the topics (the distri-

bution that would occur if v

were set to z

′

). The gen-

erator G then maps



c f

, ξ



→

x where ξ is random

noise and

x is a generated text. In practice, G can be

implemented as a transformer-based language model

or any sequence decoder that accepts a conditioning

vector (here z

c f

may be fed through a projection and

used as the initial hidden state or as a prompt). The

discriminator D is trained to distinguish real docu-

ment x from generated

x while G is trained to fool

We optimize the standard WGAN objective with

gradient penalty,

max

min

x∼P

data

[D(x)] − E

z,ξ

[D(G(z, ξ))]

+ λ E



(∥ ∇

x) ∥

−1)



(10)

which guides G to produce outputs whose distri-

bution matches the real text distribution (under vari-

ous interventions z). Here

x denotes points along the

line between real and generated samples for the gra-

dient penalty. Crucially, because z is drawn from our

causal model (including interventional cases that may

not appear in the training data), the generator learns

to produce texts for both observational and counter-

factual topic combinations.

This approach is analogous to the CausalGAN

framework (Kocaoglu et al., 2017), where a generator

architecture consistent with a causal graph can output

samples from both the true observational and inter-

ventional distributions. In our case, the SCM pro-

vides z for interventions, and the conditional GAN

learns to map those to realistic text, effectively learn-

ing P (text|topics) for both normal and intervened top-

ics. Prior work has demonstrated the feasibility of us-

ing WGAN to generate data consistent with a given

causal graph, even for interventions not seen in the

training set.

3.6.2 BERT-Based Consistency Regularization

While the GAN loss ensures the generated text is re-

alistic, we also want the counterfactual text

x to re-

main maximally similar to the original text x in all

aspects except those affected by the intervention. We

introduce a consistency module using BERT to en-

force this. Let Enc

BERT

(x)be the contextual embed-

ding of the original text, and likewise Enc

BERT

(

x) for

the generated text. We add a penalty term:

cons

=∥ Enc

BERT

(x) − Enc

BERT

(

x) ∥ (11)

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

280

which encourages the generated text to lie close

to the original in semantic embedding space. In prac-

tice, we compute this as the cosine distance between

BERT embeddings of x and

x or as a weighted token-

level similarity loss. The idea is to preserve the doc-

ument’s main content, modifying only the topical de-

tails related to the intervened variables. The BERT

consistency loss guides G to make minimal but pre-

cise edits. We also explicitly verify that G has in-

deed effected the desired change (e.g., by checking

that keywords for topic v

are reduced or removed in

x).

In summary, our counterfactual generation mod-

ule produces alternative versions of documents by

toggling causal factors, using a WGAN to maintain

ﬂuency and realism and a BERT-based regularizer to

maintain ﬁdelity to the source content.

3.7 Joint Objective and Training

Strategy

All components of our model are trained jointly to

ensure that causal structure learning, topic modeling,

and text generation inform each other. We formu-

late a multi-term loss that combines the objectives

of the structural modules and the generative modules.

Speciﬁcally, our overall loss L

total

includes:

• A structure loss L

struct

for ﬁtting the causal graph

to data (for example, the negative log-likelihood

of the observed topic occurrences under the Neu-

ral SCM, or an evidence lower bound as in a

VAE), plus any L1/L2 regularization on A to en-

courage sparse, interpretable graphs.

• The DAG constraint penalty λ

DAG

h(A)

to en-

force acyclicity (or an augmented Lagrangian

term as in NOTEARS).

• The GAN loss terms for text generation (the gen-

erator and discriminator Wasserstein losses, de-

noted L

and L

• The BERT consistency loss L

cons

. We weight

these components with hyperparameters to bal-

ance their inﬂuence:

total

= L

struct

+ λ

DAG

h(A) + λ

+ λ

cons

(12)

Our training strategy is a hybrid of alternating op-

timization and stage-wise training. Inspired by two-

stage approaches in causal generative modeling, we

ﬁrst train the structure and SCM on observational data

alone, then train the GAN generator and discriminator

on text generation given the learned topics, and ﬁnally

ﬁne-tune all components together.

In the ﬁrst stage, we optimize L

struct

with the

DAG constraint (using the augmented Lagrangian or a

penalty method) to learn A and the neural parameters

of f

(as well as to learn good initial GNN weights for

encoding topics). This may be done via variational in-

ference: e.g., treat the topic activations z

in each doc-

ument as latent variables and maximize an Evidence

Lower Bound(ELBO), or simply by treating inferred

topic vectors (from an unsupervised topic model or

the LLM extraction frequencies) as training data for a

regression model deﬁned by the SCM.

Once a reasonable causal graph and SCM are

learned, we proceed to train the GAN. We gener-

ate training pairs



c f

, x



by taking real documents x,

inferring their topic vectorzrandomly sampling inter-

ventions on z(to get z

c f

), and using x as the “real”

example associated with original zversus G



c f



as a

”fake” example for the intervened case. The discrimi-

nator D learns to judge realism, while G learns to pro-

duce plausible text for both actual and hypothetical

topic conditions. We incorporate the consistency loss

with the original text x during G’s updates to ensure

counterfactuals remain anchored to x.

In the ﬁnal joint stage, we allow gradients from

the text generation loss to also update the earlier

modules (SCM and GNN), which can further re-

ﬁne the topic representations to better support ﬂu-

ent generation. In practice, we alternate between an

epoch of structure+SCM/GNN updates (minimizing

struct

+ λ

DAG

h(A) while keeping GAN ﬁxed), and an

epoch of GAN updates (minimizing L

+ λ

cons

and maximizing L

with A and SCM ﬁxed). This al-

ternating schedule is similar in spirit to expectation-

maximization or the two-phase training of Causal-

GAN – ﬁrst learn to model the latent causal factors,

then learn to generate observable data from those fac-

tors.

Finally, we ﬁne-tune end-to-end on the combined

objective L

total

(with a small learning rate) to ensure

all parts are mutually consistent. Throughout train-

ing, we monitor the DAG constraint and gradually in-

crease the penalty coefﬁcient λ

DAG

to drive h(A) → 0,

ensuring the learned graph remains acyclic.

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs

281

4 EXPERIMENTS AND RESULTS

4.1 Experimental Setup

4.1.1 Dataset and Preprocessing

We train and evaluate CGNTM on the PubMed Lung

Cancer Corpus, consisting of approximately 20,000

English-language articles (titles and abstracts) pub-

lished over the past two decades. The dataset was

built by querying PubMed with domain-speciﬁc key-

words such as “lung cancer” and “non-small cell car-

cinoma”.

For structured causal inputs, we use an LLM

with prompt engineering to extract salient keywords

from each document. Contextual patterns infer causal

relationships, producing triples in the form ⟨cause,

relation, effect⟩ to construct document-level causal

graphs for the CGNTM pipeline. The corpus is split

into 80% training and 20% testing, with 10% of train-

ing held out for hyperparameter tuning.

4.1.2 Evaluation Metrics

We evaluate CGNTM on topic quality and causal cor-

rectness using ﬁve metrics.

Topic Coherence: Normalized Pointwise Mu-

tual Information (NPMI) measures the average nor-

malized pointwise mutual information among the top

words within each topic. For topic t, let W

denote its

top-k words. The NPMI score is computed as:

NPMI(t) =

k(k − 1)

∑

1≤i< j≤k

log

(

)

P(w

(

)

−log P (w

, w

)

(13)

where P (w

) and P (w

, w

) are estimated from

corpus-wide co-occurrence counts. Higher NPMI in-

dicates stronger semantic coherence and aligns better

with human judgment.

Topic Diversity (TD): Measures the proportion of

unique words across all topics. Let T denote the set

of all topics and V

top

the set of all top-k words across

topics. Then:

TD =

t∈T

k · |T |

(14)

Higher TD implies broader topic coverage and

lower redundancy.

Causal Precision (CP): The proportion of in-

ferred causal edges (i → j) ∈ E that match a curated

biomedical causal knowledge base E

∗

CP =

E ∩ E

∗

|E|

(15)

The CP metric is a standard precision measure in

causal discovery literature, directly adapted from the

predictive model evaluation metric “Precision” to as-

sess the accuracy of inferred causal edges. Higher

CP indicates better alignment with known causal re-

lations.

Reverse Causality Rate (RCR): Measures the

fraction of inferred causal edges that contradict

known causal directionality:

RCR =

(i → j) ∈ E | ( j → i) ∈ E

∗

|E|

(16)

A lower RCR suggests more accurate directional

inference.

Counterfactual Semantic Alignment (CSA):

Assesses whether interventions modify only the tar-

geted causal components while preserving unrelated

semantics. Let x be a document,

x its counterfactual

version, and φ(·)the [CLS] embedding from BERT.

Then:

CSA(x,

x) = cos (φ(x), φ(

x)) (17)

where cosine similarity is used to measure seman-

tic alignment. Higher CSA reﬂects more precise and

faithful counterfactual generation.

4.1.3 Implementation Details

CGNTM is implemented in PyTorch, using a 2-layer

GAT with directional masking for causal graph prop-

agation and a 3-layer MLP for nonlinear SCM de-

pendencies. Hierarchical modeling aggregates base-

level topic embeddings into abstract topics. Coun-

terfactual generation employs a WGAN with gra-

dient penalty (λ = 10) and BERT-based regulariza-

tion. Optimization uses Adam (learning rate 0.001,

batch size 16) for up to 100 epochs with early stop-

ping. Experiments run on a GPU, hyperparameters

are tuned on a validation set. Source code is available

at https://github.com/Longcchao-Wang/Causal-Topic

for reproducibility.

4.2 Quantitative and Qualitative

Results

4.2.1 Comparison with Baselines

We evaluate CGNTM against six representative base-

lines spanning classical, neural, graph-based, and

causal paradigms, focusing on models that support

probabilistic latent factor modeling for uniﬁed topic

discovery and causal inference. Recent semantic

clustering approaches like BERTopic and Top2Vec

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

282

achieve strong coherence via pre-trained embeddings

but lack latent variables essential for causal structure

learning, hence their exclusion. The baselines are:

• CRNTM(Tang et al., 2024): A supervised causal

model learning relations among latent topics

and labels via structural equation modeling over

a DAG; supervision simulated with synthetic

biomedical risk factor labels for fair comparison

(Tang et al., 2024).

• LDA(Blei et al., 2003): A classical probabilis-

tic model with Dirichlet priors on document–topic

and topic–word distributions.

• ETM(Dieng et al., 2020): A neural topic model

projecting words into continuous latent spaces to

improve topic coherence.

• NVDM(Miao et al., 2016): A variational autoen-

coder encoding documents as latent vectors from

bag-of-words input.

• GNTM(Shen et al., 2021): A graph-based neural

model using document-level word co-occurrence

graphs and GNNs.

• GNTM-CK(Zhu et al., 2023): GNTM extended

with ConceptNet commonsense knowledge.

These facilitate comparisons across supervised

vs. unsupervised, causal vs. correlational, and

knowledge-enhanced vs. data-driven paradigms. Ta-

ble 1 summarizes performance across NPMI, TD, CP,

RCR, and CSA, showing CGNTM outperforms all in

coherence, diversity, and causal alignment.

For topic quality, CGNTM achieves the highest

NPMI (0.30), surpassing supervised CRNTM (0.29)

and unsupervised baselines like GNTM-CK (0.26)

and ETM (0.24), indicating superior coherence from

biomedical priors and causal constraints. It also

leads in TD (0.82), reﬂecting broader coverage with

minimal redundancy via hierarchical modeling and

synonym-aware clustering.

In causal accuracy, only CRNTM and CGNTM

produce explicit graphs; CGNTM’s CP (0.70) nears

CRNTM (0.80), uncovering meaningful biomedi-

cal causalities without supervision, while its low

RCR (0.10) conﬁrms reliable directionality close to

CRNTM (0.07). Non-causal models are N/A for

CP/RCR.

For CSA, CGNTM’s 0.88 surpasses all base-

lines, ensuring counterfactuals remain semantically

consistent except for targeted interventions, unlike

CRNTM’s 0.80 due to lacking explicit counterfactual

training. Other models lack intervention support, ren-

dering CSA inapplicable.

Table 1: Comparison of CGNTM with baseline models.

Model NPMI TD CP RCR CSA

LDA 0.18 0.81 0.51 N/A N/A

NVDM 0.22 0.72 0.53 N/A N/A

ETM 0.24 0.73 0.56 N/A N/A

GNTM 0.25 0.76 0.58 N/A N/A

GNTM-CK 0.26 0.77 0.63 N/A N/A

CRNTM 0.29 0.80 0.69 0.07 0.80

CGNTM

(ours)

0.30 0.82 0.70 0.10 0.88

4.2.2 Ablation Study

To assess the contribution of individual components

within CGNTM, we conduct an ablation study with

four modiﬁed variants (denoted as “w/o” for “with-

out”): w/o LLM Extraction, w/o Neural SCM, w/o

WGAN + Consistency, and w/o Hierarchy, as sum-

marized in Table 2.

a) w/o LLM Extraction: Replaces LLM-based

keyword and causal triple extraction with co-

occurrence-based graphs (e.g., PMI edges). Perfor-

mance drops in CP and CSA validate the importance

of knowledge-guided structure.

b) w/o Neural SCM: Replaces the nonlinear Struc-

tural Causal Model with a linear or identity mapping,

disabling deep causal propagation. NPMI and CP de-

cline, highlighting the beneﬁt of modeling nonlinear

causal effects.

c) w/o WGAN + Consistency: Removes the coun-

terfactual generation and semantic consistency loss.

While core topic metrics remain stable, CSA signiﬁ-

cantly drops, conﬁrming the WGAN’s role in ensur-

ing targeted and semantically aligned interventions.

d) w/o Hierarchy: Flattens the topic structure

by removing macro-micro topic separation. TD de-

creases due to more redundancy, and NPMI also

slightly declines, suggesting that hierarchical model-

ing improves topic specialization.

Table 2: Ablation results for CGNTM.

Model NPMI TD CP RCR CSA

Full CGNTM 0.30 0.82 0.70 0.10 0.88

(–) LLM

Extraction

0.27 0.80 0.61 0.15 0.79

(–) Neural SCM 0.28 0.81 0.64 0.13 0.83

(–) WGAN

+Consistency

0.29 0.81 0.67 0.11 0.76

(–) Hierarchical

Structure

0.28 0.76 0.66 0.12 0.84

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs

283

4.3 Hyperparameter Sensitivity

We evaluate the robustness of CGNTM with respect

to two key hyperparameters: the number of topics (K)

and the knowledge weight (λ), which controls the in-

ﬂuence of the concept graph.

Number of Topics (K): We varied K from 20 to

100 and observed its impact on NPMI and CSA. Topic

coherence (NPMI) improves as K increases, peaking

around K = 50, then plateaus or slightly declines as

topics become too ﬁne-grained (e.g., 0.30 at K = 50

vs. 0.29 at K = 100). Topic diversity grows with K,

but with diminishing returns after 50. In terms of

causal metrics, CSA peaks in the range of K = 50-

60, balancing coherence and coverage. Too few top-

ics (K = 20) yield broad, less speciﬁc topics (CSA

≈ 45%), while too many (K = 100) introduce redun-

dancy and fragment topic quality.

Knowledge Weight (λ): We tested λ in the range

[0, 1.0]. At λ = 0 (no concept supervision), CP and

CSA drop substantially, as expected. Increasing λ to

0.5 steadily improves causal metrics, with CSA rising

from ∼45% to ∼58%. However, too high a weight

(λ = 1.0) slightly reduces coherence (NPMI ≈ 0.285),

as the model may overﬁt to concept connections. We

found λ = 0.5–0.7 provides the best trade-off, and

used λ = 0.6 as default.

Overall, CGNTM shows stable performance

across a wide hyperparameter range. We recommend

K ≈ 50 and λ in [0.5, 0.7] for corpora of similar size

and domain complexity. These results conﬁrm that

CGNTM’s gains are not contingent on narrow hyper-

parameter settings, but stem from the model’s design.

4.4 Summary and Discussion

The results underscore CGNTM’s strengths as the

ﬁrst unsupervised topic model integrating LLM-based

extraction with neural causal modeling to uncover in-

terpretable causal relations. Relying solely on un-

labeled data, CGNTM matches supervised CRNTM

in performance while discovering novel causali-

ties beyond label structures and organizing topics

hierarchically—a capability CRNTM lacks.

Compared to unsupervised models like NVDM,

CGNTM excels in coherence and diversity, construct-

ing directed graphs for causal reasoning. For exam-

ple, it infers directionality (e.g., EGFR mutation →

drug resistance) where traditional models merely co-

cluster terms, enabling counterfactual simulation and

hypothesis generation.

Relative to NVDM, CGNTM enhances quality

via causal regularization, avoiding posterior collapse.

Unlike BERTopic’s embedding clustering without

causality, CGNTM’s generative framework (SCM, di-

rectional GNN, causal priors) captures both semantic

and causal structures.

In summary, CGNTM bridges contextual topic

modeling and causal discovery, advancing unsuper-

vised methods with descriptive and explanatory in-

sights aligned to domain knowledge.

5 CONCLUSION

We introduce CGNTM, the ﬁrst unsupervised causal

topic model merging LLM knowledge extraction with

neural causal inference. It discovers interpretable

topics and domain-reﬂective causal graphs without

labels, achieving competitive coherence and diver-

sity while enabling counterfactual reasoning through

structured SCM and GNN design. This supports ex-

planatory modeling, inferring relations like “EGFR

mutation → drug resistance” from text-valuable for

biomedicine and social sciences.

Limitations include dependence on LLM triple

quality (errors impact inference), computational in-

tensity (BERT embeddings, GNN propagation, adver-

sarial training), and causal evaluation challenges from

limited ground-truth.

Future work involves multilingual/cross-domain

applications, semi-supervised signals (e.g., seed

causal edges), and structured knowledge bases for

graph constraints.

Ultimately, CGNTM advances topic modeling by

embedding causal discovery unsupervised, fostering

automated hypothesis generation beyond “what” to

“why”.

ACKNOWLEDGEMENTS

This research was funded by the Innovation Fund for

Medical Sciences of Chinese Academy of Medical

Sciences grant number 2021-I2M-1-033.

REFERENCES

Behnam, A. and Wang, B. (2024). Graph neural network

causal explanation via neural causal models. In Euro-

pean Conference on Computer Vision, pages 410–427.

Springer Nature Switzerland.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3(Jan):993–1022.

Chen, L., Ban, T., Wang, X., Lyu, D., and Chen, H. (2023).

Mitigating prior errors in causal structure learning:

KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval

284

Towards llm driven prior knowledge. arXiv preprint

arXiv:2306.07032.

Dieng, A. B., Ruiz, F. J., and Blei, D. M. (2020). Topic

modeling in embedding spaces. Transactions of the

Association for Computational Linguistics, 8:439–

453.

Gao, H., Yao, C., Li, J., Si, L., Jin, Y., Wu, F., and

Liu, H. (2024). Rethinking causal relationships learn-

ing in graph neural networks. In Proceedings of

the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 38, pages 12145–12154.

Grootendorst, M. (2022). Bertopic: Neural topic model-

ing with a class-based tf-idf procedure. arXiv preprint

arXiv:2203.05794.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and

Courville, A. C. (2017). Improved training of wasser-

stein gans. In Advances in Neural Information Pro-

cessing Systems, volume 30.

Kaushik, D., Hovy, E., and Lipton, Z. C. (2019). Learn-

ing the difference that makes a difference with

counterfactually-augmented data. arXiv preprint

arXiv:1909.12434.

Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath,

S. (2017). Causalgan: Learning causal implicit gener-

ative models with adversarial training. arXiv preprint

arXiv:1709.02023.

Lagemann, K., Lagemann, C., Taschler, B., and Mukherjee,

S. (2023). Deep learning of causal structures in high

dimensions under data limitations. Nature Machine

Intelligence, 5(11):1306–1316.

Liu, Z., Grau-Bove, J., and Orr, S. A. (2022). Bert-ﬂow-

vae: a weakly-supervised model for multi-label text

classiﬁcation. arXiv preprint arXiv:2210.15225.

Miao, Y., Yu, L., and Blunsom, P. (2016). Neural varia-

tional inference for text processing. In International

Conference on Machine Learning, pages 1727–1736.

PMLR.

Morstatter, F. and Liu, H. (2018). In search of coherence

and consensus: measuring the interpretability of sta-

tistical topics. Journal of Machine Learning Research,

18(169):1–32.

Panwar, M., Shailabh, S., Aggarwal, M., and Krishna-

murthy, B. (2020). Tan-ntm: Topic attention net-

works for neural topic modeling. arXiv preprint

arXiv:2012.01524.

Park, S. and Kim, J. (2023). Dag-gcn: directed acyclic

causal graph discovery from real world data using

graph convolutional networks. In 2023 IEEE Interna-

tional Conference on Big Data and Smart Computing

(BigComp), pages 318–319. IEEE.

Pawlowski, N., de Castro, D. C., and Glocker, B. (2020).

Deep structural causal models for tractable counter-

factual inference. In Advances in Neural Information

Processing Systems, volume 33, pages 857–869.

Prostmaier, B., V

avra, J., Gr

un, B., and Hofmarcher, P.

(2025). Seeded poisson factorization: Leveraging do-

main knowledge to ﬁt topic models. arXiv preprint

arXiv:2503.02741.

Rana, M., Hacioglu, K., Gopalan, S., and Boothalingam,

M. (2024). Zero-shot slot ﬁlling in the age of llms for

dialogue systems. arXiv preprint arXiv:2411.18980.

Shen, D., Qin, C., Wang, C., Dong, Z., Zhu, H., and

Xiong, H. (2021). Topic modeling revisited: A doc-

ument graph-based neural network perspective. In

Advances in Neural Information Processing Systems,

volume 34, pages 14681–14693.

Srivastava, A. and Sutton, C. (2017). Autoencoding vari-

ational inference for topic models. arXiv preprint

arXiv:1703.01488.

Tang, Y. K., Huang, H., Shi, X., and Mao, X. L. (2024). Be-

yond labels and topics: Discovering causal relation-

ships in neural topic modeling. In Proceedings of the

ACM Web Conference 2024, pages 4460–4469.

Venugopalan, M. and Gupta, D. (2022). An enhanced

guided lda model augmented with bert based semantic

strength for aspect term extraction in sentiment analy-

sis. Knowledge-based Systems, 246:108668.

Wang, B., Li, J., Chang, H., Zhang, K., and Tsung, F.

(2025). Heterophilic graph neural networks optimiza-

tion with causal message-passing. In Proceedings of

the Eighteenth ACM International Conference on Web

Search and Data Mining, pages 829–837.

Wu, Y., McConnell, L., and Iriondo, C. (2024). Counterfac-

tual generative modeling with variational causal infer-

ence. arXiv preprint arXiv:2410.12730.

Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., and Wang, J.

(2021). Causalvae: Disentangled representation learn-

ing via neural structural causal models. In Proceed-

ings of the IEEE/CVF Conference on Computer Vision

and Pattern Recognition, pages 9593–9602.

Yang, Y., Nafea, M. S., Ghassami, A., and Kiyavash, N.

(2022). Causal discovery in linear structural causal

models with deterministic relations. In Conference

on Causal Learning and Reasoning, pages 944–993.

PMLR.

Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). Dag-gnn: Dag

structure learning with graph neural networks. In In-

ternational Conference on Machine Learning, pages

7154–7163. PMLR.

cevi

c, M., Dhami, D. S., Veli

ckovi

c, P., and Kersting, K.

(2021). Relating graph neural networks to structural

causal models. arXiv preprint arXiv:2109.04173.

Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P.

(2018). Dags with no tears: Continuous optimization

for structure learning. In Advances in Neural Informa-

tion Processing Systems, volume 31.

Zhu, B., Cai, Y., and Ren, H. (2023). Graph neural topic

model with commonsense knowledge. Information

Processing & Management, 60(2):103215.

Zhu, Q., Feng, Z., and Li, X. (2018). Graphbtm: Graph en-

hanced autoencoded variational inference for biterm

topic model. In Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Process-

ing, pages 4663–4672.

CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs

285