CGNTM: Unsupervised Causal Topic Modeling with LLMs and
Nonlinear Causal GNNs
Peixuan Men
a
, Longchao Wang
b
, Aihua Li
c
and Xiaoli Tang
d
Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences and Peking Union Medical
College, Beijing, China
Keywords:
Causal Topic Modeling, Unsupervised Learning, Neural Causal Modeling, Graph Neural Networks.
Abstract:
We propose CGNTM, a fully unsupervised causal topic model that integrates large language models (LLMs)
with neural causal inference. Unlike conventional and supervised topic models, CGNTM learns both hierar-
chical topics and their directed causal relations directly from raw text, without requiring labeled data. The
framework leverages LLM-based prompt extraction to identify salient keywords and candidate causal pairs,
which are refined through differentiable Directed Acyclic Graph (DAG) learning and modeled via a nonlinear
structural causal model (SCM). A directionally masked graph neural network (GNN) propagates information
strictly along causal edges, while a Wasserstein Generative Adversarial Network (GAN) enforces semantic
consistency under counterfactual interventions via BERT-based regularization. This combination enables the
model to not only discover coherent and diverse topics but also uncover interpretable causal relationships
among them. The architecture supports hierarchical topic organization by clustering fine-grained terms into
broader themes and modeling cross-level dependencies through dual-layer message passing. Experimental re-
sults demonstrate that CGNTM outperforms state-of-the-art models in topic quality and causal interpretability.
Ablation studies confirm the essential role of each component-LLM-guided extraction, nonlinear SCM, direc-
tional GNN propagation, and adversarial training-in contributing to both causal accuracy and topic coherence.
The proposed framework opens new directions for unsupervised causal discovery in text, offering transforma-
tive potential in domains where understanding why certain topics co-occur is as crucial as identifying what
they are.
1 INTRODUCTION
Topic modeling is a vital tool in natural language
processing for uncovering hidden themes in large
text corpora. Classical models like Latent Dirich-
let Allocation (LDA) summarize documents into in-
terpretable topics, supporting tasks such as classifi-
cation and retrieval, but rely on bag-of-words, as-
sume independence, and ignore semantic dependen-
cies, limiting interpretability and omitting concept
relationships (Morstatter and Liu, 2018). Recent
Neural Topic Models (NTMs) leverage deep genera-
tive networks for flexible inference, enhancing coher-
ence through contextualized embeddings or external
knowledge (Shen et al., 2021). However, they capture
only statistical co-occurrence, not causal relationships
a
https://orcid.org/0009-0002-2630-3838
b
https://orcid.org/0009-0009-1387-3517
c
https://orcid.org/0000-0001-6742-3268
d
https://orcid.org/0000-0001-6946-3482
among topics, hindering interpretability and the abil-
ity to answer “why” questions from text data.
Recent efforts integrate causality, such as the
supervised Causal Relationship-Aware Neural Topic
Model (CRNTM) (Tang et al., 2024), which uses
Structural Causal Models (SCMs) to uncover topic-
label links in a Directed Acyclic Graph (DAG). This
improves structure and quality but requires supervi-
sion. Discovering causal relations in unlabeled cor-
pora, particularly with hierarchical organization, re-
mains an open challenge (Lagemann et al., 2023).
This paper addresses unsupervised causal topic
discovery: identifying hierarchical topics and infer-
ring a DAG of their causal relationships from raw
text without supervision. This tackles the intertwined
challenges of multi-granularity topic extraction and
causal graph inference using statistical patterns and
semantic knowledge for interpretable structures.
We propose the Causal Graph Neural Topic Model
(CGNTM), integrating LLMs, causal graph learning,
Men, P., Wang, L., Li, A. and Tang, X.
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs.
DOI: 10.5220/0013708200004000
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2025) - Volume 1: KDIR, pages 275-285
Proceedings Copyright © 2025 by SCITEPRESS Science and Technology Publications, Lda.
275
and a GNN in an unsupervised pipeline. It extracts
keywords and causal pairs via LLMs, refines them
through NOTEARS (Zheng et al., 2018) to enforce
DAG constraints, models nonlinear causality with a
neural SCM and directional GNN, and refines results
via a WGAN with BERT-based consistency for coun-
terfactual generation, enhancing topic coherence and
interpretability.
In summary, this paper introduces CGNTM as a
novel unsupervised framework for causal topic dis-
covery. Its key contributions are as follows:
The first unsupervised topic modeling approach
integrating LLMs to extract causal relationships
and construct a causal topic graph from unlabeled
corpora.
A novel architecture combining a nonlinear neural
SCM with a masked directional GNN for cause-
to-effect propagation.
Hierarchical causal structure via two-layer GNN
and clustering for multi-granular interpretation.
Adversarial counterfactual training with WGAN
and BERT consistency to improve coherence and
prevent spurious links.
2 RELATED WORK
2.1 Neural and Hierarchical Topic
Modeling Approaches
Topic modeling has evolved from classical proba-
bilistic models to deep learning-based methods. Tra-
ditional approaches like Latent Dirichlet Allocation
(LDA) (Blei et al., 2003) and its hierarchical exten-
sions (Miao et al., 2016) infer topics as latent vari-
ables under assumptions such as word exchangeabil-
ity and topic independence, which limit expressive-
ness and coherence.
NTMs address these by incorporating variational
autoencoders or transformers for flexible representa-
tions. Key examples include ProdLDA (Srivastava
and Sutton, 2017), Embedded Topic Model (ETM)
(Dieng et al., 2020) , and contextualized models
(Venugopalan and Gupta, 2022), which use learned
distributions and semantic embeddings to boost in-
terpretability. Recent advances integrate graphs or
knowledge, such as BERTopic (Grootendorst, 2022)
for embedding-based topics and BERT-Flow-VAE
(Liu et al., 2022) for calibrated embeddings in VAE
modeling.
However, these methods still treat topic relations
as statistical correlations or taxonomic hierarchies
rather than causal dependencies.
2.2 Causal Discovery and Structural
Causal Models in NLP
Causal discovery is gaining traction in NLP, espe-
cially topic modeling. SCMs formalize relations via
functional equations like , where are parents in a DAG
(Yang et al., 2022; Pawlowski et al., 2020), though
unsupervised text applications are limited.
Works like CRNTM incorporate SCMs to link
latent topics and labels , but require supervision.
CausalVAE adds a causal layer in VAEs for latent
DAGs without strong supervision (Panwar et al.,
2020; Yang et al., 2021), yet focuses on disentan-
gled factors, not textual topics (Wu et al., 2024). Few
explore fully unsupervised causal discovery in NLP
(Prostmaier et al., 2025), often relying on labels or
disentanglement rather than interpretable themes.
2.3 GNNs for Causal Inference and
Topic Modeling
GNNs model relational patterns effectively. In causal
discovery, DAG-GNN combines GNNs with VAEs
for DAG learning from data, optimizing acyclic-
ity constraints (Yu et al., 2019; Park and Kim,
2023). For topic modeling, GNNs propagate in-
formation via graphs, as in GNTM (Shen et al.,
2021) and GraphBTM (Zhu et al., 2018) using co-
occurrence statistics. However, integrating GNN-
driven causal learning with topics is underexplored
(Gao et al., 2024; Behnam and Wang, 2024) meth-
ods treat GNNs as auxiliaries, not embedding them in
causal pipelines. No work jointly learns topics and
causal structures end-to-end unsupervised.
2.4 Large Language Models for
Zero-Shot Keyword and Relation
Extraction
LLMs demonstrate strong zero-shot capabilities in
extracting keywords and semantic relations from text
(Rana et al., 2024; Chen et al., 2023). Using natural-
language prompts, LLMs can identify salient terms
and hypothesize causal connections without task-
specific training, making them valuable for uncov-
ering candidate causal pairs in unlabeled corpora.
Recent works leverage LLMs for weakly supervised
causal discovery by querying plausible relations or
generating pseudo-labels. However, standalone LLM
extraction lacks consistency across documents and
fails to yield a structured topic model.
These limitations underscore the need for
CGNTM, which integrates LLM-driven knowledge
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
276
extraction with neural causal inference. Unlike
conventional models such as BERTopic, which lack
causal interpretation, or CRNTM, which requires
supervision, CGNTM enables fully unsupervised
causal topic discovery. It combines zero-shot prompt-
based extraction with a nonlinear GNN-based causal
inference module, facilitating interpretable topic
modeling and causal structure learning without
labeled data. This positions CGNTM as a novel
contribution at the intersection of NLP, causality, and
deep learning.
3 METHODS
We propose a multi-stage framework (Figure 1) that
combines LLM-based semantic parsing with neural
causal graph modeling to uncover latent topics and
their causal relations. The pipeline proceeds as fol-
lows: (1) extract candidate keywords and causal pairs
from text using a LLM; (2) construct a global di-
rected causal graph over these candidates and refine
it via NOTEARS continuous optimization to enforce
acyclicity; (3) instantiate a SCM on this graph with
neural functional mechanisms; (4) perform directed
message passing with a GNN that masks reverse (anti-
causal) edges; (5) extend the GNN to a dual-layer
architecture for hierarchical topic representation; (6)
generate counterfactual textual data by intervening on
the learned causal topic variables and using a condi-
tional WGAN with a BERT-based consistency mod-
ule; and (7) train all components under a joint objec-
tive. The subsequent sections elaborate on each stage
in detail.
Figure 1: The overall multi-stage framework of CGNTM.
As illustrated in Figure 1, our framework seam-
lessly integrates symbolic knowledge extraction with
neural causal modeling. The following subsections
detail each component’s technical implementation
and theoretical foundations.
3.1 Keyword and Causal Pair
Extraction via LLMs
The first stage extracts candidate topic variables (key-
words) and causal links from an unlabeled corpus us-
ing a large language model (LLM). For each docu-
ment, the LLM is prompted to output: (a) salient key-
words summarizing its content, and (b) any explicit
cause-effect pairs (e.g., “X causes Y”) mentioned in
the text. Prior studies show LLMs can identify high-
quality keyphrases and infer causal relations in a zero-
shot manner. Building on this, we aggregate each doc-
ument’s extracted keywords and directed pairs into a
global set of topic terms and candidate edges. Each
edge may carry a confidence score based on frequency
or LLM-reported likelihood. This unsupervised pars-
ing injects semantic knowledge into the pipeline, pro-
viding a plausible initial causal graph structure.
3.2 Global Causal Graph Construction
and NOTEARS Optimization
We next consolidate the LLM-derived keywords and
causal pairs into a unified global causal topic graph.
This graph is represented by an adjacency matrix,
where each entry indicates a directed edge between
topics. The initial structure is constructed from
the candidate edges identified in Section 3.1, with
weights set either uniformly or based on LLM-
provided confidence scores. Since these raw connec-
tions may contain cycles or noisy links, we refine the
graph by solving a structure learning problem with an
acyclicity constraint.
To this end, we use the NOTEARS method, which
casts DAG learning as a smooth optimization prob-
lem. It learns a weighted adjacency matrix that min-
imizes a given loss while ensuring the graph remains
acyclic through a differentiable constraint based on
the matrix exponential formulation. Specifically, we
impose:
h(A) = Tr[exp(A A)] m = 0 (1)
where A A denotes the elementwise square
and m = |V | is the number of nodes. Intuitively,
Tr (exp(A A)) = m if and only if A has no cyclic de-
pendencies. This hard equality constraint can be han-
dled via a penalty or augmented Lagrangian method
during optimization. We thus solve:
min
A, Θ
L
score
(A, Θ) s.t. h(A) = 0 (2)
where Θ denotes other model parameters (dis-
cussed later) and L
score
is a structure score or loss.
We initialize A with the LLM-extracted graph and it-
eratively update it to reduce L
score
while applying the
DAG constraint (using gradient-based optimization).
This yields a global causal topic graph that best ex-
plains the data without cycles. By casting structure
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs
277
discovery as continuous optimization, we avoid brute-
force search over graphs and can efficiently handle
the moderate number of topic nodes. The resulting
adjacency A (with learned weights) will serve as the
backbone for our neural causal model.
3.3 Neural Structural Causal Model
(NSCM)
Given the DAG learned in section 3.2, we define a
structural causal model to describe how each topic
variable is generated as a function of its causes. Fol-
lowing the Pearlian framework, an SCM consists of
a set of structural equations X
i
= f
i
(PA
i
, N
i
) for each
node X
i
(representing topic v
i
), where PA
i
denotes the
set of parent variables (direct causes of X
i
in G) and
N
i
is an exogenous noise term (independent for each
i) capturing unmodeled variation.
In our Neural SCM, we parameterize each struc-
tural function as a neural network, allowing for com-
plex non-linear causal relationships between topics.
In particular, for each directed edge v
j
v
i
in the
graph, the causal influence of topic v
j
on topic v
i
is
modeled by learnable weights within the neural func-
tion f
i
. Formally, let z
i
be a latent representation or
“activation” of topic v
i
. We define:
z
i
= f
i

z
j
: v
j
PA
i

+ n
i
(3)
where n
i
N
0, σ
2
(or another suitable noise
distribution) and f
i
is implemented as a multilayer
perceptron (MLP). This yields a deep non-linear
structural equation model(SEM) that generalizes tra-
ditional linear causal models.
The collection { f
i
}
m
i=1
together with adjacency A
defines a joint distribution P(X
1
, . . . , X
m
) over topic
variables essentially a causal topic model. Impor-
tantly, because A is acyclic, we can sample from this
model by ancestral sampling (order the topics topo-
logically and sample each X
i
from f
i
(PA
i
)). The
Neural SCM grants our model the capacity to capture
complex interactions (e.g., a cause may affect an ef-
fect in a highly non-linear or context-dependent way)
(Ze
ˇ
cevi
´
c et al., 2021). It also enables interventions:
we can apply the do-operator do(X
i
= x
) by clamp-
ing z
i
to a chosen value and forward-simulating the
rest of the model, to infer causal effects or counter-
factual topic configurations.
In summary, at this stage we have a parameter-
ized causal model M = (A, { f
i
}) that explains how
topics cause one another. The parameters of f
i
(and
any distributional parameters of N
i
) will be learned
from data, typically by maximizing the likelihood of
observed topic occurrences in documents.
3.4 Directionally Masked GNN
Propagation
To leverage the learned causal graph during both
training and inference, we implement a directionally
masked Graph Attention Network (GAT) that propa-
gates information strictly along directed edges (Wang
et al., 2025; Liu et al., 2022). Unlike conventional
GATs that allow bidirectional attention, our model en-
forces causal directionality to align with the causal se-
mantics of the structural causal model (SCM).
We implement a directionally masked GNN on the
directed acyclic graph (DAG) learned in Section 3.2
(Kaushik et al., 2019). The DAG is represented by an
adjacency matrix A, where A
ji
denotes the weight of
the directed edge from node j to node i, with A
ji
= 0
for nonexistent or directionally excluded edges (i.e.,
edges not present in the causal graph). Each topic
node v
i
is associated with a hidden state vector h
(l)
i
at layer l of the GNN (with h
(0)
i
initialized from node
features, e.g., textual embedding of the keyword or an
initial topic score in a document). For each topic node
v
i
, we compute attention weights only over its parent
nodes, defined as PA
i
= { j|A
ji
> 0}:
α
ji
= softmax(LeakyReLU(a
T
[W
(l)
· h
(l)
j
W
(l)
· h
(l)
i
])) (4)
Then the layer-wise update for node i is defined
as:
h
(l+1)
i
= σ
jPA
i
α
ji
·W
(l)
· h
(l)
j
!
(5)
where denotes concatenation, a is a learnable
attention vector, and W
(l)
are layer-specific weight
matrices, and σ(·) is a non-linear activation func-
tion. The attention coefficients α
ji
serve as dynamic,
learned weights on edge j i, capturing causal influ-
ence adaptively based on node features.
In matrix form, the propagation can be expressed
as:
H
(l+1)
= σ
A · (H
(l)
·W
(l)
)
(6)
Where H
(l)
is the matrix of all node hidden states
at layer l, and A is the masked adjacency matrix
from Section 3.2, ensuring that information propa-
gates only along the causal directions defined by the
DAG. By stacking L such layers (where L is the num-
ber of GNN layers, typically set to 2–4 based on the
graph depth and empirical tuning), each node’s repre-
sentation h
(L)
i
captures information from its causal an-
cestors up to L hops away. This directed GNN aligns
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
278
naturally with the semantics of the SCM: messages
flow from causes to effects, mirroring how interven-
tions or changes propagate in the causal graph. We
use this GNN both to encode observational data into
latent topic representations and to simulate the spread
of causal influence in the topic graph.
Unlike standard graph convolutions that mix in-
coming and outgoing messages, our GNN restricts
propagation to incoming edges defined by PA
i
, imple-
menting causal message passing. While inspired by
generative graph models such as DAG-GNN, our ar-
chitecture enforces strict temporal directionality, en-
suring information flows from cause to effect. The di-
rectional GNN operates jointly with the neural SCM
(Section 3.3) to enable efficient inference, where the
NSCM defines the generative process z
i
, specify-
ing how topic variables are generated based on their
causal parents, while the GNN implements the infer-
ence process by computing topic representations from
observed data while respecting causal constraints.
During training, the GNN parameters {W
(l)
, a} are
learned jointly with the SCM functions { f
i
} as part
of the unified optimization framework (Section 3.7),
ensuring the inference network is consistent with the
generative causal model. Given partial topic activa-
tions, the GNN infers missing variables via causal
propagation. Under interventions, it rapidly computes
perturbed topic representations, supporting counter-
factual reasoning critical for applications.
3.5 Hierarchical Topic Modeling with
Dual-Layer GNN
Real-world topics often exhibit hierarchical structure,
with fine-grained concepts nested within broader the-
matic categories. To capture this hierarchy, we extend
our single-layer causal topic model to a dual-layer ar-
chitecture that explicitly represents both micro-level
topics and macro-level themes. Our hierarchical ap-
proach consists of three sequential steps: bottom-
up aggregation, horizontal propagation, and top-down
refinement.
We introduce an upper layer of abstract topics
that group together semantically related base top-
ics. Concretely, suppose we partition V into K clus-
ters {C
1
, . . . ,C
K
} (each C
k
is a subset of the base
topics that constitute a higher-level theme). These
clusters could be obtained by heuristic clustering of
the keyword embeddings or even by another pass of
LLM-based grouping. We then create K new nodes
{ ˜v
1
, . . . , ˜v
K
} representing the abstract topics (one per
cluster). We connect each base topic node v
i
to its
abstract parent ˜v
k
(if v
i
C
k
) via an undirected link,
and we also allow directed causal edges among the
abstract nodes themselves (induced by the base-level
DAG: e.g., if some v
i
C
a
causes v
j
C
b
, we add a
directed edge ˜v
a
˜v
b
between abstract topics).
This yields a two-layer graph: Layer 1 comprises
base topics V with directed edges E (from our learned
A). Layer 2 comprises abstract topics
˜
V with directed
edges among them. Additionally, bipartite connec-
tions link each ˜v
k
to all v
i
C
k
.
We then design a two-tier message passing
scheme:
3.5.1 Bottom-up Aggregation
In the bottom-up aggregation step, each abstract topic
node k computes an initial state as the mean of its
member topic embeddings:
e
h
(0)
k
=
1
|
C
k
|
v
i
C
k
h
(L)
i
(7)
using the final embeddings h
(L)
i
from the base
GNN layer as input. This step produces a represen-
tation for the higher-level topic as a composition of
its subtopics.
3.5.2 Horizontal Propagation at the Abstract
Level
Next, in the horizontal propagation step, we apply an-
other GNN on the abstract topic graph (layer 2) for
L
steps, using the directed edges among
e
V . This is
analogous to section 3.4 but on the smaller graph of
K nodes.For each abstract topic node
e
v
a
, we compute
attention weights only over its parent nodes:
α
mk
= softmax(LeakyReLU(a
T
[W
(l)
·
e
h
(l)
m
W
(l)
·
e
h
(l)
k
])) (8)
The layer-wise update for node
e
v
K
is:
e
h
(l+1)
k
= σ
e
v
m
PA(
e
v
k
)
α
mk
·W
(l)
·
e
h
(l)
m
!
,
l = 0, . . . , L
1 (9)
where PA(
e
v
k
) are abstract parent nodes of
e
v
k
in the
abstract DAG. α
mk
is the attention coefficient between
parent
e
v
m
and target
e
v
k
. This yields refined high-level
topic embeddings
e
h
(L
)
k
that capture how broad themes
influence each other.
3.5.3 Top-down Refinement
Finally, in the top-down refinement step, the abstract
nodes pass messages back to their children to update
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs
279
base topic embeddings with global context (e.g., each
v
i
may receive an additive message from its abstract
parent
e
v
k
). In our implementation, we incorporate
this by concatenating the parent’s representation to
the base node before a final linear transformation.
The resulting model forms a hierarchical topic
structure: base-level nodes represent fine-grained
concepts, while abstract-level nodes capture broader
thematic categories. Directional GNN propagation
ensures semantic consistency across levels. This de-
sign is inspired by hierarchical topic models such as
hLDA, which organize topics using Bayesian priors.
In contrast, our approach employs deterministic clus-
tering and neural message passing to simultaneously
model intra-level causal dependencies and inter-level
abstractions. The dual-layer GNN enhances inter-
pretability through structured topic organization and
improves predictive performance by enabling statisti-
cal sharing among related topics.
3.6 Counterfactual Generation with
WGAN and BERT Consistency
A central motivation of our causal topic model is
to enable counterfactual reasoning—i.e., answering
“what if” questions by generating text under hypo-
thetical interventions. To this end, we leverage the
learned causal graph and SCM to guide a text gen-
eration module that produces counterfactual docu-
ments, while enforcing semantic consistency via a
pretrained language model (BERT). Our approach
adopts a conditional generative adversarial network,
where the generator receives an intervened topic rep-
resentation and outputs synthetic text, and the dis-
criminator distinguishes between real and generated
samples. We employ the Wasserstein GAN variant to
enhance training stability and mode coverage (Gulra-
jani et al., 2017).
3.6.1 Conditioning on Causal Topics
The generator is conditioned on the causal topic vec-
tor of a document, which we obtain from the Neu-
ral SCM/GNN. For each real document d, we first
infer its topic activation vector z = [z
1
, . . . , z
m
] using
the current model this can be done by feeding d
through the GNN encoder or by direct inference in
the SCM. We then sample an intervention on z: for in-
stance, to generate a counterfactual where topic v
k
is
altered, we set z
k
to a new value z
k
(e.g., zero to simu-
late removing that topic, or a higher value to simulate
emphasizing it) while keeping other z
i̸=k
the same, or
also updating descendants of v
k
via the SCM to reflect
causal effects. Denote this intervened topic vector as
z
c f
= do
v
k
= z
k
.
According to our causal model, z
c f
represents a
coherent counterfactual state of the topics (the distri-
bution that would occur if v
k
were set to z
k
). The gen-
erator G then maps
z
c f
, ξ
e
x where ξ is random
noise and
e
x is a generated text. In practice, G can be
implemented as a transformer-based language model
or any sequence decoder that accepts a conditioning
vector (here z
c f
may be fed through a projection and
used as the initial hidden state or as a prompt). The
discriminator D is trained to distinguish real docu-
ment x from generated
e
x while G is trained to fool
D.
We optimize the standard WGAN objective with
gradient penalty,
max
D
min
G
E
xP
data
[D(x)] E
z,ξ
[D(G(z, ξ))]
+ λ E
(
b
x
D(
b
x)
2
1)
2
(10)
which guides G to produce outputs whose distri-
bution matches the real text distribution (under vari-
ous interventions z). Here
b
x denotes points along the
line between real and generated samples for the gra-
dient penalty. Crucially, because z is drawn from our
causal model (including interventional cases that may
not appear in the training data), the generator learns
to produce texts for both observational and counter-
factual topic combinations.
This approach is analogous to the CausalGAN
framework (Kocaoglu et al., 2017), where a generator
architecture consistent with a causal graph can output
samples from both the true observational and inter-
ventional distributions. In our case, the SCM pro-
vides z for interventions, and the conditional GAN
learns to map those to realistic text, effectively learn-
ing P (text|topics) for both normal and intervened top-
ics. Prior work has demonstrated the feasibility of us-
ing WGAN to generate data consistent with a given
causal graph, even for interventions not seen in the
training set.
3.6.2 BERT-Based Consistency Regularization
While the GAN loss ensures the generated text is re-
alistic, we also want the counterfactual text
e
x to re-
main maximally similar to the original text x in all
aspects except those affected by the intervention. We
introduce a consistency module using BERT to en-
force this. Let Enc
BERT
(x)be the contextual embed-
ding of the original text, and likewise Enc
BERT
(
e
x) for
the generated text. We add a penalty term:
L
cons
= Enc
BERT
(x) Enc
BERT
(
e
x) (11)
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
280
which encourages the generated text to lie close
to the original in semantic embedding space. In prac-
tice, we compute this as the cosine distance between
BERT embeddings of x and
e
x or as a weighted token-
level similarity loss. The idea is to preserve the doc-
ument’s main content, modifying only the topical de-
tails related to the intervened variables. The BERT
consistency loss guides G to make minimal but pre-
cise edits. We also explicitly verify that G has in-
deed effected the desired change (e.g., by checking
that keywords for topic v
k
are reduced or removed in
e
x).
In summary, our counterfactual generation mod-
ule produces alternative versions of documents by
toggling causal factors, using a WGAN to maintain
fluency and realism and a BERT-based regularizer to
maintain fidelity to the source content.
3.7 Joint Objective and Training
Strategy
All components of our model are trained jointly to
ensure that causal structure learning, topic modeling,
and text generation inform each other. We formu-
late a multi-term loss that combines the objectives
of the structural modules and the generative modules.
Specifically, our overall loss L
total
includes:
A structure loss L
struct
for fitting the causal graph
to data (for example, the negative log-likelihood
of the observed topic occurrences under the Neu-
ral SCM, or an evidence lower bound as in a
VAE), plus any L1/L2 regularization on A to en-
courage sparse, interpretable graphs.
The DAG constraint penalty λ
DAG
|
h(A)
|
to en-
force acyclicity (or an augmented Lagrangian
term as in NOTEARS).
The GAN loss terms for text generation (the gen-
erator and discriminator Wasserstein losses, de-
noted L
G
and L
D
).
The BERT consistency loss L
cons
. We weight
these components with hyperparameters to bal-
ance their influence:
L
total
= L
struct
+ λ
DAG
h(A) + λ
G
L
G
+ λ
D
L
D
+ λ
cons
L
cons
(12)
Our training strategy is a hybrid of alternating op-
timization and stage-wise training. Inspired by two-
stage approaches in causal generative modeling, we
first train the structure and SCM on observational data
alone, then train the GAN generator and discriminator
on text generation given the learned topics, and finally
fine-tune all components together.
In the first stage, we optimize L
struct
with the
DAG constraint (using the augmented Lagrangian or a
penalty method) to learn A and the neural parameters
of f
i
(as well as to learn good initial GNN weights for
encoding topics). This may be done via variational in-
ference: e.g., treat the topic activations z
i
in each doc-
ument as latent variables and maximize an Evidence
Lower Bound(ELBO), or simply by treating inferred
topic vectors (from an unsupervised topic model or
the LLM extraction frequencies) as training data for a
regression model defined by the SCM.
Once a reasonable causal graph and SCM are
learned, we proceed to train the GAN. We gener-
ate training pairs
z
c f
, x
by taking real documents x,
inferring their topic vectorzrandomly sampling inter-
ventions on z(to get z
c f
), and using x as the “real”
example associated with original zversus G
z
c f
as a
”fake” example for the intervened case. The discrimi-
nator D learns to judge realism, while G learns to pro-
duce plausible text for both actual and hypothetical
topic conditions. We incorporate the consistency loss
with the original text x during Gs updates to ensure
counterfactuals remain anchored to x.
In the final joint stage, we allow gradients from
the text generation loss to also update the earlier
modules (SCM and GNN), which can further re-
fine the topic representations to better support flu-
ent generation. In practice, we alternate between an
epoch of structure+SCM/GNN updates (minimizing
L
struct
+ λ
DAG
h(A) while keeping GAN fixed), and an
epoch of GAN updates (minimizing L
G
+ λ
cons
L
cons
and maximizing L
D
with A and SCM fixed). This al-
ternating schedule is similar in spirit to expectation-
maximization or the two-phase training of Causal-
GAN first learn to model the latent causal factors,
then learn to generate observable data from those fac-
tors.
Finally, we fine-tune end-to-end on the combined
objective L
total
(with a small learning rate) to ensure
all parts are mutually consistent. Throughout train-
ing, we monitor the DAG constraint and gradually in-
crease the penalty coefficient λ
DAG
to drive h(A) 0,
ensuring the learned graph remains acyclic.
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs
281
4 EXPERIMENTS AND RESULTS
4.1 Experimental Setup
4.1.1 Dataset and Preprocessing
We train and evaluate CGNTM on the PubMed Lung
Cancer Corpus, consisting of approximately 20,000
English-language articles (titles and abstracts) pub-
lished over the past two decades. The dataset was
built by querying PubMed with domain-specific key-
words such as “lung cancer” and “non-small cell car-
cinoma”.
For structured causal inputs, we use an LLM
with prompt engineering to extract salient keywords
from each document. Contextual patterns infer causal
relationships, producing triples in the form cause,
relation, effect to construct document-level causal
graphs for the CGNTM pipeline. The corpus is split
into 80% training and 20% testing, with 10% of train-
ing held out for hyperparameter tuning.
4.1.2 Evaluation Metrics
We evaluate CGNTM on topic quality and causal cor-
rectness using five metrics.
Topic Coherence: Normalized Pointwise Mu-
tual Information (NPMI) measures the average nor-
malized pointwise mutual information among the top
words within each topic. For topic t, let W
t
denote its
top-k words. The NPMI score is computed as:
NPMI(t) =
2
k(k 1)
1i< jk
log
P
(
w
i
,w
j
)
P(w
i
)P
(
w
j
)
log P (w
i
, w
j
)
(13)
where P (w
i
) and P (w
i
, w
j
) are estimated from
corpus-wide co-occurrence counts. Higher NPMI in-
dicates stronger semantic coherence and aligns better
with human judgment.
Topic Diversity (TD): Measures the proportion of
unique words across all topics. Let T denote the set
of all topics and V
top
the set of all top-k words across
topics. Then:
TD =
|
S
tT
W
t
|
k · |T |
(14)
Higher TD implies broader topic coverage and
lower redundancy.
Causal Precision (CP): The proportion of in-
ferred causal edges (i j) E that match a curated
biomedical causal knowledge base E
:
CP =
|
E E
|
|E|
(15)
The CP metric is a standard precision measure in
causal discovery literature, directly adapted from the
predictive model evaluation metric “Precision” to as-
sess the accuracy of inferred causal edges. Higher
CP indicates better alignment with known causal re-
lations.
Reverse Causality Rate (RCR): Measures the
fraction of inferred causal edges that contradict
known causal directionality:
RCR =
|{
(i j) E | ( j i) E
}|
|E|
(16)
A lower RCR suggests more accurate directional
inference.
Counterfactual Semantic Alignment (CSA):
Assesses whether interventions modify only the tar-
geted causal components while preserving unrelated
semantics. Let x be a document,
e
x its counterfactual
version, and φ(·)the [CLS] embedding from BERT.
Then:
CSA(x,
e
x) = cos (φ(x), φ(
e
x)) (17)
where cosine similarity is used to measure seman-
tic alignment. Higher CSA reflects more precise and
faithful counterfactual generation.
4.1.3 Implementation Details
CGNTM is implemented in PyTorch, using a 2-layer
GAT with directional masking for causal graph prop-
agation and a 3-layer MLP for nonlinear SCM de-
pendencies. Hierarchical modeling aggregates base-
level topic embeddings into abstract topics. Coun-
terfactual generation employs a WGAN with gra-
dient penalty (λ = 10) and BERT-based regulariza-
tion. Optimization uses Adam (learning rate 0.001,
batch size 16) for up to 100 epochs with early stop-
ping. Experiments run on a GPU, hyperparameters
are tuned on a validation set. Source code is available
at https://github.com/Longcchao-Wang/Causal-Topic
for reproducibility.
4.2 Quantitative and Qualitative
Results
4.2.1 Comparison with Baselines
We evaluate CGNTM against six representative base-
lines spanning classical, neural, graph-based, and
causal paradigms, focusing on models that support
probabilistic latent factor modeling for unified topic
discovery and causal inference. Recent semantic
clustering approaches like BERTopic and Top2Vec
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
282
achieve strong coherence via pre-trained embeddings
but lack latent variables essential for causal structure
learning, hence their exclusion. The baselines are:
CRNTM(Tang et al., 2024): A supervised causal
model learning relations among latent topics
and labels via structural equation modeling over
a DAG; supervision simulated with synthetic
biomedical risk factor labels for fair comparison
(Tang et al., 2024).
LDA(Blei et al., 2003): A classical probabilis-
tic model with Dirichlet priors on document–topic
and topic–word distributions.
ETM(Dieng et al., 2020): A neural topic model
projecting words into continuous latent spaces to
improve topic coherence.
NVDM(Miao et al., 2016): A variational autoen-
coder encoding documents as latent vectors from
bag-of-words input.
GNTM(Shen et al., 2021): A graph-based neural
model using document-level word co-occurrence
graphs and GNNs.
GNTM-CK(Zhu et al., 2023): GNTM extended
with ConceptNet commonsense knowledge.
These facilitate comparisons across supervised
vs. unsupervised, causal vs. correlational, and
knowledge-enhanced vs. data-driven paradigms. Ta-
ble 1 summarizes performance across NPMI, TD, CP,
RCR, and CSA, showing CGNTM outperforms all in
coherence, diversity, and causal alignment.
For topic quality, CGNTM achieves the highest
NPMI (0.30), surpassing supervised CRNTM (0.29)
and unsupervised baselines like GNTM-CK (0.26)
and ETM (0.24), indicating superior coherence from
biomedical priors and causal constraints. It also
leads in TD (0.82), reflecting broader coverage with
minimal redundancy via hierarchical modeling and
synonym-aware clustering.
In causal accuracy, only CRNTM and CGNTM
produce explicit graphs; CGNTM’s CP (0.70) nears
CRNTM (0.80), uncovering meaningful biomedi-
cal causalities without supervision, while its low
RCR (0.10) confirms reliable directionality close to
CRNTM (0.07). Non-causal models are N/A for
CP/RCR.
For CSA, CGNTM’s 0.88 surpasses all base-
lines, ensuring counterfactuals remain semantically
consistent except for targeted interventions, unlike
CRNTM’s 0.80 due to lacking explicit counterfactual
training. Other models lack intervention support, ren-
dering CSA inapplicable.
Table 1: Comparison of CGNTM with baseline models.
Model NPMI TD CP RCR CSA
LDA 0.18 0.81 0.51 N/A N/A
NVDM 0.22 0.72 0.53 N/A N/A
ETM 0.24 0.73 0.56 N/A N/A
GNTM 0.25 0.76 0.58 N/A N/A
GNTM-CK 0.26 0.77 0.63 N/A N/A
CRNTM 0.29 0.80 0.69 0.07 0.80
CGNTM
(ours)
0.30 0.82 0.70 0.10 0.88
4.2.2 Ablation Study
To assess the contribution of individual components
within CGNTM, we conduct an ablation study with
four modified variants (denoted as “w/o” for “with-
out”): w/o LLM Extraction, w/o Neural SCM, w/o
WGAN + Consistency, and w/o Hierarchy, as sum-
marized in Table 2.
a) w/o LLM Extraction: Replaces LLM-based
keyword and causal triple extraction with co-
occurrence-based graphs (e.g., PMI edges). Perfor-
mance drops in CP and CSA validate the importance
of knowledge-guided structure.
b) w/o Neural SCM: Replaces the nonlinear Struc-
tural Causal Model with a linear or identity mapping,
disabling deep causal propagation. NPMI and CP de-
cline, highlighting the benefit of modeling nonlinear
causal effects.
c) w/o WGAN + Consistency: Removes the coun-
terfactual generation and semantic consistency loss.
While core topic metrics remain stable, CSA signifi-
cantly drops, confirming the WGAN’s role in ensur-
ing targeted and semantically aligned interventions.
d) w/o Hierarchy: Flattens the topic structure
by removing macro-micro topic separation. TD de-
creases due to more redundancy, and NPMI also
slightly declines, suggesting that hierarchical model-
ing improves topic specialization.
Table 2: Ablation results for CGNTM.
Model NPMI TD CP RCR CSA
Full CGNTM 0.30 0.82 0.70 0.10 0.88
(–) LLM
Extraction
0.27 0.80 0.61 0.15 0.79
(–) Neural SCM 0.28 0.81 0.64 0.13 0.83
(–) WGAN
+Consistency
0.29 0.81 0.67 0.11 0.76
(–) Hierarchical
Structure
0.28 0.76 0.66 0.12 0.84
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs
283
4.3 Hyperparameter Sensitivity
We evaluate the robustness of CGNTM with respect
to two key hyperparameters: the number of topics (K)
and the knowledge weight (λ), which controls the in-
fluence of the concept graph.
Number of Topics (K): We varied K from 20 to
100 and observed its impact on NPMI and CSA. Topic
coherence (NPMI) improves as K increases, peaking
around K = 50, then plateaus or slightly declines as
topics become too fine-grained (e.g., 0.30 at K = 50
vs. 0.29 at K = 100). Topic diversity grows with K,
but with diminishing returns after 50. In terms of
causal metrics, CSA peaks in the range of K = 50-
60, balancing coherence and coverage. Too few top-
ics (K = 20) yield broad, less specific topics (CSA
45%), while too many (K = 100) introduce redun-
dancy and fragment topic quality.
Knowledge Weight (λ): We tested λ in the range
[0, 1.0]. At λ = 0 (no concept supervision), CP and
CSA drop substantially, as expected. Increasing λ to
0.5 steadily improves causal metrics, with CSA rising
from 45% to 58%. However, too high a weight
(λ = 1.0) slightly reduces coherence (NPMI 0.285),
as the model may overfit to concept connections. We
found λ = 0.5–0.7 provides the best trade-off, and
used λ = 0.6 as default.
Overall, CGNTM shows stable performance
across a wide hyperparameter range. We recommend
K 50 and λ in [0.5, 0.7] for corpora of similar size
and domain complexity. These results confirm that
CGNTM’s gains are not contingent on narrow hyper-
parameter settings, but stem from the model’s design.
4.4 Summary and Discussion
The results underscore CGNTM’s strengths as the
first unsupervised topic model integrating LLM-based
extraction with neural causal modeling to uncover in-
terpretable causal relations. Relying solely on un-
labeled data, CGNTM matches supervised CRNTM
in performance while discovering novel causali-
ties beyond label structures and organizing topics
hierarchically—a capability CRNTM lacks.
Compared to unsupervised models like NVDM,
CGNTM excels in coherence and diversity, construct-
ing directed graphs for causal reasoning. For exam-
ple, it infers directionality (e.g., EGFR mutation
drug resistance) where traditional models merely co-
cluster terms, enabling counterfactual simulation and
hypothesis generation.
Relative to NVDM, CGNTM enhances quality
via causal regularization, avoiding posterior collapse.
Unlike BERTopic’s embedding clustering without
causality, CGNTM’s generative framework (SCM, di-
rectional GNN, causal priors) captures both semantic
and causal structures.
In summary, CGNTM bridges contextual topic
modeling and causal discovery, advancing unsuper-
vised methods with descriptive and explanatory in-
sights aligned to domain knowledge.
5 CONCLUSION
We introduce CGNTM, the first unsupervised causal
topic model merging LLM knowledge extraction with
neural causal inference. It discovers interpretable
topics and domain-reflective causal graphs without
labels, achieving competitive coherence and diver-
sity while enabling counterfactual reasoning through
structured SCM and GNN design. This supports ex-
planatory modeling, inferring relations like “EGFR
mutation drug resistance” from text-valuable for
biomedicine and social sciences.
Limitations include dependence on LLM triple
quality (errors impact inference), computational in-
tensity (BERT embeddings, GNN propagation, adver-
sarial training), and causal evaluation challenges from
limited ground-truth.
Future work involves multilingual/cross-domain
applications, semi-supervised signals (e.g., seed
causal edges), and structured knowledge bases for
graph constraints.
Ultimately, CGNTM advances topic modeling by
embedding causal discovery unsupervised, fostering
automated hypothesis generation beyond “what” to
“why”.
ACKNOWLEDGEMENTS
This research was funded by the Innovation Fund for
Medical Sciences of Chinese Academy of Medical
Sciences grant number 2021-I2M-1-033.
REFERENCES
Behnam, A. and Wang, B. (2024). Graph neural network
causal explanation via neural causal models. In Euro-
pean Conference on Computer Vision, pages 410–427.
Springer Nature Switzerland.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of Machine Learning Re-
search, 3(Jan):993–1022.
Chen, L., Ban, T., Wang, X., Lyu, D., and Chen, H. (2023).
Mitigating prior errors in causal structure learning:
KDIR 2025 - 17th International Conference on Knowledge Discovery and Information Retrieval
284
Towards llm driven prior knowledge. arXiv preprint
arXiv:2306.07032.
Dieng, A. B., Ruiz, F. J., and Blei, D. M. (2020). Topic
modeling in embedding spaces. Transactions of the
Association for Computational Linguistics, 8:439–
453.
Gao, H., Yao, C., Li, J., Si, L., Jin, Y., Wu, F., and
Liu, H. (2024). Rethinking causal relationships learn-
ing in graph neural networks. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 38, pages 12145–12154.
Grootendorst, M. (2022). Bertopic: Neural topic model-
ing with a class-based tf-idf procedure. arXiv preprint
arXiv:2203.05794.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. C. (2017). Improved training of wasser-
stein gans. In Advances in Neural Information Pro-
cessing Systems, volume 30.
Kaushik, D., Hovy, E., and Lipton, Z. C. (2019). Learn-
ing the difference that makes a difference with
counterfactually-augmented data. arXiv preprint
arXiv:1909.12434.
Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath,
S. (2017). Causalgan: Learning causal implicit gener-
ative models with adversarial training. arXiv preprint
arXiv:1709.02023.
Lagemann, K., Lagemann, C., Taschler, B., and Mukherjee,
S. (2023). Deep learning of causal structures in high
dimensions under data limitations. Nature Machine
Intelligence, 5(11):1306–1316.
Liu, Z., Grau-Bove, J., and Orr, S. A. (2022). Bert-flow-
vae: a weakly-supervised model for multi-label text
classification. arXiv preprint arXiv:2210.15225.
Miao, Y., Yu, L., and Blunsom, P. (2016). Neural varia-
tional inference for text processing. In International
Conference on Machine Learning, pages 1727–1736.
PMLR.
Morstatter, F. and Liu, H. (2018). In search of coherence
and consensus: measuring the interpretability of sta-
tistical topics. Journal of Machine Learning Research,
18(169):1–32.
Panwar, M., Shailabh, S., Aggarwal, M., and Krishna-
murthy, B. (2020). Tan-ntm: Topic attention net-
works for neural topic modeling. arXiv preprint
arXiv:2012.01524.
Park, S. and Kim, J. (2023). Dag-gcn: directed acyclic
causal graph discovery from real world data using
graph convolutional networks. In 2023 IEEE Interna-
tional Conference on Big Data and Smart Computing
(BigComp), pages 318–319. IEEE.
Pawlowski, N., de Castro, D. C., and Glocker, B. (2020).
Deep structural causal models for tractable counter-
factual inference. In Advances in Neural Information
Processing Systems, volume 33, pages 857–869.
Prostmaier, B., V
´
avra, J., Gr
¨
un, B., and Hofmarcher, P.
(2025). Seeded poisson factorization: Leveraging do-
main knowledge to fit topic models. arXiv preprint
arXiv:2503.02741.
Rana, M., Hacioglu, K., Gopalan, S., and Boothalingam,
M. (2024). Zero-shot slot filling in the age of llms for
dialogue systems. arXiv preprint arXiv:2411.18980.
Shen, D., Qin, C., Wang, C., Dong, Z., Zhu, H., and
Xiong, H. (2021). Topic modeling revisited: A doc-
ument graph-based neural network perspective. In
Advances in Neural Information Processing Systems,
volume 34, pages 14681–14693.
Srivastava, A. and Sutton, C. (2017). Autoencoding vari-
ational inference for topic models. arXiv preprint
arXiv:1703.01488.
Tang, Y. K., Huang, H., Shi, X., and Mao, X. L. (2024). Be-
yond labels and topics: Discovering causal relation-
ships in neural topic modeling. In Proceedings of the
ACM Web Conference 2024, pages 4460–4469.
Venugopalan, M. and Gupta, D. (2022). An enhanced
guided lda model augmented with bert based semantic
strength for aspect term extraction in sentiment analy-
sis. Knowledge-based Systems, 246:108668.
Wang, B., Li, J., Chang, H., Zhang, K., and Tsung, F.
(2025). Heterophilic graph neural networks optimiza-
tion with causal message-passing. In Proceedings of
the Eighteenth ACM International Conference on Web
Search and Data Mining, pages 829–837.
Wu, Y., McConnell, L., and Iriondo, C. (2024). Counterfac-
tual generative modeling with variational causal infer-
ence. arXiv preprint arXiv:2410.12730.
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., and Wang, J.
(2021). Causalvae: Disentangled representation learn-
ing via neural structural causal models. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 9593–9602.
Yang, Y., Nafea, M. S., Ghassami, A., and Kiyavash, N.
(2022). Causal discovery in linear structural causal
models with deterministic relations. In Conference
on Causal Learning and Reasoning, pages 944–993.
PMLR.
Yu, Y., Chen, J., Gao, T., and Yu, M. (2019). Dag-gnn: Dag
structure learning with graph neural networks. In In-
ternational Conference on Machine Learning, pages
7154–7163. PMLR.
Ze
ˇ
cevi
´
c, M., Dhami, D. S., Veli
ˇ
ckovi
´
c, P., and Kersting, K.
(2021). Relating graph neural networks to structural
causal models. arXiv preprint arXiv:2109.04173.
Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P.
(2018). Dags with no tears: Continuous optimization
for structure learning. In Advances in Neural Informa-
tion Processing Systems, volume 31.
Zhu, B., Cai, Y., and Ren, H. (2023). Graph neural topic
model with commonsense knowledge. Information
Processing & Management, 60(2):103215.
Zhu, Q., Feng, Z., and Li, X. (2018). Graphbtm: Graph en-
hanced autoencoded variational inference for biterm
topic model. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 4663–4672.
CGNTM: Unsupervised Causal Topic Modeling with LLMs and Nonlinear Causal GNNs
285